BRIEF INTRODUCTION

We have chosen some datasets which represent statistics from all players in the Top 5 European Leagues which measure their performance in their domestic league in the Season 2022-2023. In this project we will analyse this datasets to obtain insights and different information about these football players. This could then be used in scouting, training or performance improvement, between many other related applications.

1. SET UP

rm(list = ls())

# For reproducibility pourposes
set.seed(123)

# List of required libraries
required_libraries <- c("dplyr", "skimr", "ggplot2", "ggrepel", "stringi", "GGally", "factoextra", "corrplot", "FactoMineR", "psych", "GPArotation", "reshape2", "tidyr", "patchwork", "ellipse", "RColorBrewer", "cluster", "mclust", "igraph", "pheatmap", "caret", "plotly", "kernlab")

# Install and load libraries
for (lib in required_libraries) {
  if (!require(lib, character.only = TRUE)) {
    install.packages(lib, dependencies = TRUE)
    library(lib, character.only = TRUE)
  }
}
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: skimr
## Warning: package 'skimr' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Loading required package: ggrepel
## Loading required package: stringi
## Warning: package 'stringi' was built under R version 4.3.3
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Loading required package: factoextra
## Warning: package 'factoextra' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: corrplot
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
## Loading required package: FactoMineR
## Warning: package 'FactoMineR' was built under R version 4.3.2
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## Loading required package: GPArotation
## Warning: package 'GPArotation' was built under R version 4.3.3
## 
## Attaching package: 'GPArotation'
## The following objects are masked from 'package:psych':
## 
##     equamax, varimin
## Loading required package: reshape2
## Warning: package 'reshape2' was built under R version 4.3.2
## Loading required package: tidyr
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
## Loading required package: patchwork
## Warning: package 'patchwork' was built under R version 4.3.2
## Loading required package: ellipse
## Warning: package 'ellipse' was built under R version 4.3.2
## 
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
## 
##     pairs
## Loading required package: RColorBrewer
## Loading required package: cluster
## Loading required package: mclust
## Warning: package 'mclust' was built under R version 4.3.3
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:psych':
## 
##     sim
## Loading required package: igraph
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
## Loading required package: pheatmap
## Warning: package 'pheatmap' was built under R version 4.3.3
## Loading required package: caret
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 4.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:igraph':
## 
##     groups
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:psych':
## 
##     alpha
## The following object is masked from 'package:ggplot2':
## 
##     alpha

We have different datasets which contain different statistics from different players which we will need to combine to obtain a richer and more useful dataset.

# Reading the different datasets
defense_stats <- read.csv("Big5-defenseStats22-23.csv", sep = ",")
gca_stats <- read.csv("Big5-gcaStats22-23.csv", sep = ",")
misc_stats <- read.csv("Big5-miscStats22-23.csv", sep = ",")
passing_stats <- read.csv("Big5-passingStats22-23.csv", sep = ",")
possession_stats <- read.csv("Big5-possessionStats22-23.csv", sep = ",")
shooting_stats <- read.csv("Big5-shootingStats22-23.csv", sep = ",")

# Datasets for goalkeepers
keepers_one <- read.csv("Big5-keepersStats22-23.csv", sep = ",")

# Dataset for all combined players
playingtime <- read.csv("Big5-playingtimeStats22-23.csv", sep = ",")

2. DATA PREPROCESSING

We want to group all these datasets in one single one, but doing this could potentially affect our analysis as we will have a really high number of dimensions, we won’t handle duplicates or NA values just yet and we will fully focus on reducing the amount of dimensions each dataset has.

Therefore, we will look at each dataset independently handling the unimportant columns, columns that can be inferred from other datasets, or columns that are linear combinations of other columns, all of these possibilites will help us reduce dimensionality. We will also rename the variables for a better understanding and clearer explanation when visualizing the data.

# Defense Stats

# The total amount of tackles (Tkl) is the sum of Def.3rd, Mid.3rd and Att.3rd which just refer to where in the pitch those tackles were made. Tkl.Int is a sum of both tackles and interceptions therefore we can keep this columns and eliminate both interceptions and tackles. Blocks are a sum of Sh and Pass which refers to the different type of blocks. Tkl.1 is tackles in the first half, already included in the first variable. Tkl.1 measures the percentage of lost balls when Att is done, it divides Att and Lost. Many players dont have errors, and at the same time errors leading to goal are mostly done by defenders as an error in the defense side is more critical than an error in a different side of the field, therefore we will eliminate it as well.

defense_cols <- c("Def.3rd","Mid.3rd","Att.3rd","Tkl.1","Int","Sh", "Pass", "Att", "Lost","Err")
defense_stats <- defense_stats %>% select(-all_of(defense_cols))
defense_stats <- defense_stats %>% 
  rename(TacklesDone = Tkl, SuccessfulTackles = TklW, TacklesANDIntercep = Tkl.Int,
         Clearances = Clr, PercDribblersTackled = Tkl.)

# GCA stats

# SCA is an advanced metric that tracks the two offensive/attacking actions that directly lead to a shot on  goal. Following the same logic as a Shot-Creating Action, a Goal-Creating Action is an advanced metric that tracks the two offensive/attacking actions that leads to a goal. These follow the same criteria as  the above list. The difference is that one causes a goal, the other an scoring chance.
# SCA is a sum of PassLive, PassDead, TO, Sh Fld, Def, they are different stats that contribute to it, and  SCA90s can be inferred as SCA*90s returns SCA90s
# GCA follows the same deduction as before

gca_cols <- c("SCA90","PassLive", "PassDead", "TO", "Sh", "Fld", "Def", "GCA90", "PassLive.1", "PassDead.1", "TO.1", "Sh.1", "Fld.1", "Def.1")
gca_stats <- gca_stats %>% select(-all_of(gca_cols))
gca_stats <- gca_stats %>% 
  rename(ShotCreatingActions = SCA, GoalCreatingActions = GCA )

# Misc stats

# X2CrdY is when the player is suspended by 2 yellow cards instead of one red card, this is included in the Red Card statistic. TklW and Int are already included in the defense dataset. Won. refers to the aerial duels won, and it divides won and lost, so we can remove them. Own goals are many times fortunate and dont affect the skill value of a player. PKwon and PKcon refer to the amount of penalties conceded and won, but not many players have a value here so we will remove it as it is irrelevant. Crs refers to crosses and it will appear in a future dataset 

misc_cols <- c("X2CrdY", "TklW", "Int", "Won", "Lost", "OG", "PKwon", "PKcon", "Crs")
misc_stats <- misc_stats %>% select(-all_of(misc_cols))
misc_stats <- misc_stats %>% 
  rename(YellowCards = CrdY, RedCards = CrdR, FoulsCommited = Fls, FoulsDrawn = Fld, Offsides = Off, RecoveredBalls = Recov, PercAerialDuelsWon = Won.)

# Passing Stats

#Cmp. is the percentage of the completed passes in terms of the attempted ones, therefore we can remove them. Cmp.1, Att.1, Cmp..1,Cmp.2,Att.2, Cmp..2, Cmp.3,Att.3,Cmp..3 are all relative to the distance of the passes, which is not of much interest as of right now and both TotDist and PrgDist contain a lot of information about the distance of passes. 1/3, PPA, CrsPa, PrgP are also covered as they all refer to progressive passes. A.xAG is assists minus the xAG, therefore we can remove it.

passing_cols <- c("Cmp.","Cmp.1", "Att.1", "Cmp..1","Cmp.2","Att.2", "Cmp..2", "Cmp.3","Att.3","Cmp..3","X1.3", "PPA", "CrsPA", "PrgP","A.xAG")
passing_stats <- passing_stats %>% select(-all_of(passing_cols))
passing_stats <- passing_stats %>% 
  rename(CompletedPasses = Cmp, AttemptedPasses = Att, PassingDistance = TotDist, ForwardPassingDistance = PrgDist, Assists = Ast, ExpectedAssistsGoal = xAG, ExpectedAssists = xA,  KeyPasses = KP)

# Posession Stats

# Touches is a sum of Def.Pen, Def.3rd, Mid.3rd, Att.3rd, Att.Pen, Live so we can remove them. Succ. is the division of Att and Succ which refers to the number of times the attacker has dribbled the defender divided by the total attemps. Tkld, Tkld. are two variables directly related to the previous so we can also take them out. X1.3, CPA, Mis are not really valuable. TotDist and PrgDist will have different names to differentiate from the previous dataset that used the same variables. Rec, PrgR refer to the times the player received a good pass where in most of the cases if a pass is missed it is usually the passers fault rather than the receiver.


posession_cols <- c("Def.Pen", "Def.3rd", "Mid.3rd", "Att.3rd", "Att.Pen", "Live", "Att", "Succ", "Tkld", "Tkld.","X1.3", "CPA", "Mis", "Rec", "PrgR", "PrgC")
possession_stats <- possession_stats %>% select(-all_of(posession_cols))
possession_stats <- possession_stats %>% 
  rename(DistanceRanBall = TotDist, ForwardDistanceRanBall = PrgDist, PercSuccessfulDribbles = Succ., Dispossessed = Dis)

# Shooting Stats

# With goals, shots and shots on target we can infer SoT., Sh.90, SoT.90, G.Sh, G.SoT. As not many players shoot penalties or free kicks we will not take into account this, FK, PK and PKatt will be removed. npxG/Sh can be calculated with both variables, as well as G.xG and np.G.xG can also be calculated with npxG and xG

shooting_cols <- c("SoT.", "Sh.90", "SoT.90", "G.Sh", "G.SoT","FK", "PK","PKatt", "npxG.Sh", "G.xG", "np.G.xG")
shooting_stats <- shooting_stats %>% select(-all_of(shooting_cols))
shooting_stats <- shooting_stats %>% 
  rename(Goals = Gls, Shots = Sh, ShotsOnTarget = SoT, ShotDistance = Dist, ExpectedGoals = xG, NonPenaltyExpectedGoals = npxG)

# Keepers one

# The statistics MP, Starts, Min will be in another dataset so we will remove them. Save. is calculated by SoTA-GA/SoTa. CS. which refers to clean sheets can also be removed. Everything about free kicks and penalties can be removed as it is easy to score from a penalty and harder from a free kick and it usually depends on the quality of the attacker, many times the keeper cannot do much.We will remove PKatt, PKA, PKsv,PKm,Save..1. GA90 can also be inffered with Goals Against and 90s

keepers_cols <- c("MP", "Starts", "Min", "Save.", "CS.", "PKatt", "PKA", "PKsv","PKm","Save..1", "GA90")
keepers_one <- keepers_one %>% select(-all_of(keepers_cols))
keepers_one <- keepers_one %>% 
  rename(GoalsAllowed = GA, ShotsOnTargetAgainst = SoTA, Won = W, Drawn = D, Lost = L, CleanSheets = CS)



# Playing Time

# Min can be calculated by 90 * X90S.Mn.Mp can be calculated as Min/Mp, Mn. can also be calculated. Mn.Start will be Min/Starts, Compl can be seen as it is not really important. The same for all sub variables as with minutes we can already kind of see the amount of time played. We will remove Subs, Mn.Sub, unSub. X... is substraction of onG and onGA so we can remove it, and the same for the 90s, X...90. Same for On.Off.1 and On.Off. Minutes can be deduced as X90s times 90, we will also remove it.

playing_cols <- c("Mn.MP","Min.","Min", "Mn.Start", "Compl", "Subs", "Mn.Sub", "unSub","X...", "X...90", "On.Off", "On.Off.1", "xG...", "xG...90")
playingtime <- playingtime %>% select(-all_of(playing_cols))
playingtime <- playingtime %>% 
  rename(MatchesPlayed = MP, PointsPerMatch = PPM, GoalsByTeam = onG, GoalsAgainstTeam = onGA, ExepectedGoalsByTeam = onxG, ExpectedGoalsAgainstTeam = onxGA )

Merging all datasets according to different columns and dealing with missing values and duplicates later, this can be done with the inner join function taken into account that all datasets have the same 8 first columns.

# Making a list with the repeated columns
repeated_columns = c("Player", "Nation", "Pos", "Squad", "Comp", "Age", "Born", "X90s")


# Inner joining all datasets based on key columns and a brief look at it
merged_players <- inner_join(defense_stats, gca_stats, by = repeated_columns)%>%
                  inner_join(misc_stats, by = repeated_columns)%>%
                  inner_join(passing_stats, by = repeated_columns)%>%
                  inner_join(possession_stats, by = repeated_columns)%>%
                  inner_join(shooting_stats, by = repeated_columns)

skim(merged_players)
Data summary
Name merged_players
Number of rows 2889
Number of columns 43
_______________________
Column type frequency:
character 5
numeric 38
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Player 0 1 3 30 0 2715 0
Nation 0 1 0 7 4 105 0
Pos 0 1 2 5 0 10 0
Squad 0 1 4 15 0 98 0
Comp 0 1 10 18 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1.00 25.26 4.62 15.0 22.00 25.0 29.00 41.0 ▃▇▆▂▁
Born 0 1.00 1996.39 4.62 1981.0 1993.00 1997.0 2000.00 2007.0 ▁▃▇▇▃
X90s 0 1.00 13.87 10.93 0.0 3.40 12.3 22.60 38.0 ▇▅▃▃▂
TacklesDone 17 0.99 21.01 21.54 0.0 3.00 14.0 33.00 147.0 ▇▂▁▁▁
SuccessfulTackles 0 1.00 12.39 12.92 0.0 2.00 8.0 20.00 83.0 ▇▃▁▁▁
PercDribblersTackled 365 0.87 46.30 23.15 0.0 33.30 47.6 60.00 100.0 ▃▅▇▃▁
Blocks 17 0.99 14.21 13.79 0.0 2.00 11.0 23.00 99.0 ▇▃▁▁▁
TacklesANDIntercep 17 0.99 32.47 33.42 0.0 4.00 21.0 52.00 195.0 ▇▂▁▁▁
Clearances 17 0.99 23.29 31.40 0.0 3.00 11.0 30.00 215.0 ▇▁▁▁▁
ShotCreatingActions 17 0.99 27.54 30.82 0.0 4.00 17.0 41.00 219.0 ▇▂▁▁▁
GoalCreatingActions 17 0.99 2.90 3.98 0.0 0.00 1.0 4.00 35.0 ▇▁▁▁▁
YellowCards 0 1.00 2.62 2.73 0.0 0.00 2.0 4.00 14.0 ▇▃▂▁▁
RedCards 0 1.00 0.13 0.40 0.0 0.00 0.0 0.00 3.0 ▇▁▁▁▁
FoulsCommited 0 1.00 15.26 14.40 0.0 3.00 12.0 24.00 81.0 ▇▃▂▁▁
FoulsDrawn 0 1.00 14.51 15.74 0.0 2.00 10.0 21.00 122.0 ▇▂▁▁▁
Offsides 0 1.00 2.28 4.33 0.0 0.00 1.0 2.00 33.0 ▇▁▁▁▁
RecoveredBalls 17 0.99 67.88 62.71 0.0 14.00 50.0 107.00 336.0 ▇▃▂▁▁
PercAerialDuelsWon 248 0.91 47.85 22.41 0.0 33.30 48.4 60.00 100.0 ▂▅▇▃▂
CompletedPasses 17 0.99 487.88 481.24 0.0 94.00 333.5 756.00 2911.0 ▇▃▁▁▁
AttemptedPasses 17 0.99 616.64 574.42 0.0 127.00 452.0 963.00 3223.0 ▇▃▂▁▁
PassingDistance 17 0.99 8711.21 9060.69 0.0 1464.75 5623.0 13423.75 54839.0 ▇▂▁▁▁
ForwardPassingDistance 17 0.99 3176.30 3961.12 0.0 388.75 1671.0 4515.25 36334.0 ▇▁▁▁▁
Assists 0 1.00 1.18 1.91 0.0 0.00 0.0 2.00 16.0 ▇▁▁▁▁
ExpectedAssistsGoal 17 0.99 1.25 1.74 0.0 0.10 0.6 1.70 16.7 ▇▁▁▁▁
ExpectedAssists 17 0.99 1.17 1.61 0.0 0.10 0.6 1.60 16.2 ▇▁▁▁▁
KeyPasses 17 0.99 11.74 15.00 0.0 1.00 6.0 17.00 119.0 ▇▁▁▁▁
Touches 17 0.99 754.96 665.57 0.0 168.00 599.5 1190.00 3461.0 ▇▅▂▁▁
PercSuccessfulDribbles 436 0.85 47.39 22.70 0.0 35.70 46.2 57.10 100.0 ▂▅▇▂▂
Carries 17 0.99 449.86 419.97 0.0 99.00 343.0 689.00 2376.0 ▇▃▁▁▁
DistanceRanBall 17 0.99 2411.54 2294.45 0.0 512.50 1783.0 3706.00 14430.0 ▇▃▁▁▁
ForwardDistanceRanBall 17 0.99 1197.94 1206.50 0.0 220.75 836.5 1832.50 8259.0 ▇▂▁▁▁
Dispossessed 17 0.99 11.11 13.37 0.0 1.00 6.0 16.00 92.0 ▇▂▁▁▁
Goals 0 1.00 1.69 3.16 0.0 0.00 0.0 2.00 36.0 ▇▁▁▁▁
Shots 0 1.00 15.63 19.48 0.0 2.00 9.0 22.00 144.0 ▇▁▁▁▁
ShotsOnTarget 0 1.00 5.23 7.66 0.0 0.00 2.0 7.00 73.0 ▇▁▁▁▁
ShotDistance 539 0.81 17.50 5.25 2.8 13.80 17.1 20.90 37.5 ▁▇▇▂▁
ExpectedGoals 17 0.99 1.76 2.86 0.0 0.10 0.7 2.20 28.4 ▇▁▁▁▁
NonPenaltyExpectedGoals 17 0.99 1.60 2.48 0.0 0.10 0.7 2.00 23.5 ▇▁▁▁▁

We now have the problem that goalkeepers have different stats to players, therefore the number of variables varies and we cannot join them to the same dataset. Instead we need to add the columns which are not in players to goalkeepers with a 0 value for each entry and viceversa.

# Obtaining all columns in players and goalkeepers
player_columns <- colnames(merged_players)
gk_columns <- colnames(keepers_one)

# Finding out the different columns in each of the datasets
missing_columns_players = setdiff(gk_columns, player_columns)
missing_columns_gk = setdiff(player_columns, gk_columns)

# Using a for loop to create the missing columns with a value of 0
for (col in missing_columns_players) {
  merged_players[[col]] <- 0
}

for (col in missing_columns_gk) {
  keepers_one[[col]] <- 0
}

# Combining both goalkeepers and players
all_players = bind_rows(merged_players,keepers_one)

# Merging with playing time stats
final_players <- inner_join(all_players, playingtime, by = repeated_columns)
head(final_players)
##             Player  Nation   Pos          Squad               Comp Age Born
## 1 Brenden Aaronson  us USA MF,FW   Leeds United eng Premier League  21 2000
## 2  Paxten Aaronson  us USA MF,DF Eint Frankfurt      de Bundesliga  18 2003
## 3   James Abankwah  ie IRL    DF        Udinese         it Serie A  18 2004
## 4    George Abbott eng ENG    MF      Tottenham eng Premier League  16 2005
## 5 Yunis Abdelhamid  ma MAR    DF          Reims         fr Ligue 1  34 1987
## 6    Himad Abdelli  dz ALG MF,FW         Angers         fr Ligue 1  22 1999
##   X90s TacklesDone SuccessfulTackles PercDribblersTackled Blocks
## 1 26.4          45                18                 32.6     43
## 2  1.9           6                 5                 66.7      2
## 3  0.7           1                 1                   NA      4
## 4  0.0           0                 0                   NA      0
## 5 37.0          82                50                 75.0     65
## 6 23.7          64                38                 46.3     30
##   TacklesANDIntercep Clearances ShotCreatingActions GoalCreatingActions
## 1                 50          6                  95                   7
## 2                  6          2                   8                   2
## 3                  1          5                   0                   0
## 4                  1          0                   0                   0
## 5                146        116                  43                   7
## 6                 92         15                  69                   3
##   YellowCards RedCards FoulsCommited FoulsDrawn Offsides RecoveredBalls
## 1           2        0            16         55        6            128
## 2           1        0             5          3        1              8
## 3           0        0             2          1        0              2
## 4           0        0             0          0        0              0
## 5           6        0            49         28        2            263
## 6           3        0            43         31        1            180
##   PercAerialDuelsWon CompletedPasses AttemptedPasses PassingDistance
## 1               17.0             592             797            7577
## 2               50.0              51              71             659
## 3              100.0              23              29             375
## 4                 NA               1               1               8
## 5               64.6            1679            2031           32967
## 6               43.9            1043            1257           18273
##   ForwardPassingDistance Assists ExpectedAssistsGoal ExpectedAssists KeyPasses
## 1                   2182       3                 4.2             2.6        46
## 2                    109       0                 0.0             0.1         1
## 3                     79       0                 0.0             0.0         0
## 4                      0       0                 0.0             0.0         0
## 5                  13407       2                 1.0             0.9        13
## 6                   4649       2                 2.9             2.5        36
##   Touches PercSuccessfulDribbles Carries DistanceRanBall ForwardDistanceRanBall
## 1    1143                   34.0     686            3646                   1532
## 2      99                   50.0      55             314                    143
## 3      39                  100.0      18              48                     19
## 4       2                     NA       0               0                      0
## 5    2459                   56.8    1723           10871                   6123
## 6    1516                   56.6    1184            6861                   3294
##   Dispossessed Goals Shots ShotsOnTarget ShotDistance ExpectedGoals
## 1           82     1    41             9         18.4           3.9
## 2            2     0     4             1         18.7           0.2
## 3            1     0     0             0           NA           0.0
## 4            0     0     0             0           NA           0.0
## 5           23     1    32             3         15.0           2.4
## 6           20     2    24             7         22.4           1.4
##   NonPenaltyExpectedGoals GoalsAllowed ShotsOnTargetAgainst Saves Won Drawn
## 1                     3.9            0                    0     0   0     0
## 2                     0.2            0                    0     0   0     0
## 3                     0.0            0                    0     0   0     0
## 4                     0.0            0                    0     0   0     0
## 5                     2.4            0                    0     0   0     0
## 6                     1.4            0                    0     0   0     0
##   Lost CleanSheets MatchesPlayed Starts PointsPerMatch GoalsByTeam
## 1    0           0            36     28           0.86          35
## 2    0           0             7      0           1.14           4
## 3    0           0             2      1           0.00           0
## 4    0           0             1      0           3.00           1
## 5    0           0            37     37           1.38          45
## 6    0           0            30     24           0.33          21
##   GoalsAgainstTeam ExepectedGoalsByTeam ExpectedGoalsAgainstTeam
## 1               58                 33.3                     49.8
## 2                0                  5.4                      1.4
## 3                0                  0.4                      0.7
## 4                0                  0.3                      0.0
## 5               42                 60.5                     50.1
## 6               49                 23.8                     40.7

We observe many redundancies in this new dataset, such as that the nations are repeated in both lower and upper case, that some players have more than one position or that the competition also has lower case in front which could be troubling in the future. We will remove the strings with gsub and substr, and maintain only the first position for all players.

# Removing lower case in nation
final_players$Nation <- gsub("[a-z]", "", final_players$Nation)
# Saving only the first position
final_players$Pos <- substr(final_players$Pos,1,2)
# Removing lower case before leagues
final_players$Comp <- substr(final_players$Comp, 4, nchar(final_players$Comp))


# With that solved we will look at the players now to check for NAs and duplicated values
head(final_players)
##             Player Nation Pos          Squad            Comp Age Born X90s
## 1 Brenden Aaronson    USA  MF   Leeds United  Premier League  21 2000 26.4
## 2  Paxten Aaronson    USA  MF Eint Frankfurt      Bundesliga  18 2003  1.9
## 3   James Abankwah    IRL  DF        Udinese         Serie A  18 2004  0.7
## 4    George Abbott    ENG  MF      Tottenham  Premier League  16 2005  0.0
## 5 Yunis Abdelhamid    MAR  DF          Reims         Ligue 1  34 1987 37.0
## 6    Himad Abdelli    ALG  MF         Angers         Ligue 1  22 1999 23.7
##   TacklesDone SuccessfulTackles PercDribblersTackled Blocks TacklesANDIntercep
## 1          45                18                 32.6     43                 50
## 2           6                 5                 66.7      2                  6
## 3           1                 1                   NA      4                  1
## 4           0                 0                   NA      0                  1
## 5          82                50                 75.0     65                146
## 6          64                38                 46.3     30                 92
##   Clearances ShotCreatingActions GoalCreatingActions YellowCards RedCards
## 1          6                  95                   7           2        0
## 2          2                   8                   2           1        0
## 3          5                   0                   0           0        0
## 4          0                   0                   0           0        0
## 5        116                  43                   7           6        0
## 6         15                  69                   3           3        0
##   FoulsCommited FoulsDrawn Offsides RecoveredBalls PercAerialDuelsWon
## 1            16         55        6            128               17.0
## 2             5          3        1              8               50.0
## 3             2          1        0              2              100.0
## 4             0          0        0              0                 NA
## 5            49         28        2            263               64.6
## 6            43         31        1            180               43.9
##   CompletedPasses AttemptedPasses PassingDistance ForwardPassingDistance
## 1             592             797            7577                   2182
## 2              51              71             659                    109
## 3              23              29             375                     79
## 4               1               1               8                      0
## 5            1679            2031           32967                  13407
## 6            1043            1257           18273                   4649
##   Assists ExpectedAssistsGoal ExpectedAssists KeyPasses Touches
## 1       3                 4.2             2.6        46    1143
## 2       0                 0.0             0.1         1      99
## 3       0                 0.0             0.0         0      39
## 4       0                 0.0             0.0         0       2
## 5       2                 1.0             0.9        13    2459
## 6       2                 2.9             2.5        36    1516
##   PercSuccessfulDribbles Carries DistanceRanBall ForwardDistanceRanBall
## 1                   34.0     686            3646                   1532
## 2                   50.0      55             314                    143
## 3                  100.0      18              48                     19
## 4                     NA       0               0                      0
## 5                   56.8    1723           10871                   6123
## 6                   56.6    1184            6861                   3294
##   Dispossessed Goals Shots ShotsOnTarget ShotDistance ExpectedGoals
## 1           82     1    41             9         18.4           3.9
## 2            2     0     4             1         18.7           0.2
## 3            1     0     0             0           NA           0.0
## 4            0     0     0             0           NA           0.0
## 5           23     1    32             3         15.0           2.4
## 6           20     2    24             7         22.4           1.4
##   NonPenaltyExpectedGoals GoalsAllowed ShotsOnTargetAgainst Saves Won Drawn
## 1                     3.9            0                    0     0   0     0
## 2                     0.2            0                    0     0   0     0
## 3                     0.0            0                    0     0   0     0
## 4                     0.0            0                    0     0   0     0
## 5                     2.4            0                    0     0   0     0
## 6                     1.4            0                    0     0   0     0
##   Lost CleanSheets MatchesPlayed Starts PointsPerMatch GoalsByTeam
## 1    0           0            36     28           0.86          35
## 2    0           0             7      0           1.14           4
## 3    0           0             2      1           0.00           0
## 4    0           0             1      0           3.00           1
## 5    0           0            37     37           1.38          45
## 6    0           0            30     24           0.33          21
##   GoalsAgainstTeam ExepectedGoalsByTeam ExpectedGoalsAgainstTeam
## 1               58                 33.3                     49.8
## 2                0                  5.4                      1.4
## 3                0                  0.4                      0.7
## 4                0                  0.3                      0.0
## 5               42                 60.5                     50.1
## 6               49                 23.8                     40.7

2.1 Duplicate Values

Apart from that we also see there are a total of +200 duplicates which could be from players who were transferred or were loaned mid-season. To deal with this, we will keep the entries with the most time played.

# Removing duplicate players
final_players <- final_players %>%
  group_by(Player) %>%          
  # Keep the observation with the most time played
  slice_max(order_by = X90s, n = 1) %>%    
  ungroup()

# Check that there are no duplicates now
sum(duplicated(final_players))
## [1] 0

2.2 NA Values

Using a barplot to see the columns with NA values.

barplot(colMeans(is.na(final_players)), las=2)

Since all NA values are in numerical columns, we can replace these NAs by the median of that column.

# Replace NA values with the median of each column
final_players <- final_players %>%
  mutate(across(everything(), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Number of rows with NA values 
nrow(final_players %>% filter(if_any(everything(), is.na)))
## [1] 0

We see that this dataset has many statistics for players, but we could also benefit from other ones that are not there, such as the overall player rating or their price. For this we will merge part of another dataset which bases its rating and some other statistics in the FIFA videogame.

# Reading complimentary data and removing unnecessary columns
data_fifa <- read.csv("CLEAN_FIFA23_official_data.csv")
head(data_fifa)
##   X     ID            Name Age                                            Photo
## 1 0 209658     L. Goretzka  27 https://cdn.sofifa.net/players/209/658/23_60.png
## 2 1 212198 Bruno Fernandes  27 https://cdn.sofifa.net/players/212/198/23_60.png
## 3 2 224334        M. Acuña  30 https://cdn.sofifa.net/players/224/334/23_60.png
## 4 3 192985    K. De Bruyne  31 https://cdn.sofifa.net/players/192/985/23_60.png
## 5 4 224232      N. Barella  25 https://cdn.sofifa.net/players/224/232/23_60.png
## 6 5 212622      J. Kimmich  27 https://cdn.sofifa.net/players/212/622/23_60.png
##   Nationality                                Flag Overall Potential
## 1     Germany https://cdn.sofifa.net/flags/de.png      87        88
## 2    Portugal https://cdn.sofifa.net/flags/pt.png      86        87
## 3   Argentina https://cdn.sofifa.net/flags/ar.png      85        85
## 4     Belgium https://cdn.sofifa.net/flags/be.png      91        91
## 5       Italy https://cdn.sofifa.net/flags/it.png      86        89
## 6     Germany https://cdn.sofifa.net/flags/de.png      89        90
##                Club                               Club.Logo  Value... Wage...
## 1 FC Bayern München  https://cdn.sofifa.net/teams/21/30.png  91000000  115000
## 2 Manchester United  https://cdn.sofifa.net/teams/11/30.png  78500000  190000
## 3        Sevilla FC https://cdn.sofifa.net/teams/481/30.png  46500000   46000
## 4   Manchester City  https://cdn.sofifa.net/teams/10/30.png 107500000  350000
## 5             Inter  https://cdn.sofifa.net/teams/44/30.png  89500000  110000
## 6 FC Bayern München  https://cdn.sofifa.net/teams/21/30.png 105500000  130000
##   Special Preferred.Foot International.Reputation Weak.Foot Skill.Moves
## 1    2312          Right                        4         4           3
## 2    2305          Right                        3         3           4
## 3    2303           Left                        2         3           3
## 4    2303          Right                        4         5           4
## 5    2296          Right                        3         3           3
## 6    2283          Right                        4         4           3
##      Work.Rate        Body.Type Real.Face Position     Joined Loaned.From
## 1 High/ Medium           Unique       Yes      SUB 2018-07-01        None
## 2   High/ High           Unique       Yes      LCM 2020-01-30        None
## 3   High/ High Stocky (170-185)        No       LB 2020-09-14        None
## 4   High/ High           Unique       Yes      RCM 2015-08-30        None
## 5   High/ High    Normal (170-)       Yes      RCM 2020-09-01        None
## 6 High/ Medium Normal (170-185)       Yes      RDM 2015-07-01        None
##   Contract.Valid.Until Height.cm.. Weight.lbs.. Release.Clause... Kit.Number
## 1                 2026         189      180.810         157000000          8
## 2                 2026         179      152.145         155000000          8
## 3                 2024         172      152.145          97700000         19
## 4                 2025         181      154.350         198900000         17
## 5                 2026         172      149.940         154400000         23
## 6                 2025         177      165.375         182000000          6
##   Best.Overall.Rating Year_Joined
## 1                   0        2018
## 2                   0        2020
## 3                   0        2020
## 4                   0        2015
## 5                   0        2020
## 6                   0        2015
colnames(data_fifa)
##  [1] "X"                        "ID"                      
##  [3] "Name"                     "Age"                     
##  [5] "Photo"                    "Nationality"             
##  [7] "Flag"                     "Overall"                 
##  [9] "Potential"                "Club"                    
## [11] "Club.Logo"                "Value..."                
## [13] "Wage..."                  "Special"                 
## [15] "Preferred.Foot"           "International.Reputation"
## [17] "Weak.Foot"                "Skill.Moves"             
## [19] "Work.Rate"                "Body.Type"               
## [21] "Real.Face"                "Position"                
## [23] "Joined"                   "Loaned.From"             
## [25] "Contract.Valid.Until"     "Height.cm.."             
## [27] "Weight.lbs.."             "Release.Clause..."       
## [29] "Kit.Number"               "Best.Overall.Rating"     
## [31] "Year_Joined"
# Checking for duplicate values
sum(duplicated(data_fifa$Name))
## [1] 520

As it has happened with our previous football datasets, we will have to modify duplicate values.

# Obtaining the data without repeated players
data_fifa <- data_fifa %>%
  group_by(Name) %>%
  # Keeping only observations with the most recent joined date
  slice_max(Joined, n = 1, with_ties = FALSE) %>%  
  ungroup()

# Check that there are no duplicates now
sum(duplicated(data_fifa))
## [1] 0

We can also see that the dataset is already clean with no NA values but looking at the different columns and variables there are some columns which do not provide much information, are already in the previous dataset or are not of our interest for our current project, therefore, we will remove them. Since the merging will be done by name it is the only column present in other datasets that we will maintain. The names of the players in the fifa dataset has numbers before which can lead to errors when merging as well, so we will have to remove it.

colnames(data_fifa)
##  [1] "X"                        "ID"                      
##  [3] "Name"                     "Age"                     
##  [5] "Photo"                    "Nationality"             
##  [7] "Flag"                     "Overall"                 
##  [9] "Potential"                "Club"                    
## [11] "Club.Logo"                "Value..."                
## [13] "Wage..."                  "Special"                 
## [15] "Preferred.Foot"           "International.Reputation"
## [17] "Weak.Foot"                "Skill.Moves"             
## [19] "Work.Rate"                "Body.Type"               
## [21] "Real.Face"                "Position"                
## [23] "Joined"                   "Loaned.From"             
## [25] "Contract.Valid.Until"     "Height.cm.."             
## [27] "Weight.lbs.."             "Release.Clause..."       
## [29] "Kit.Number"               "Best.Overall.Rating"     
## [31] "Year_Joined"
cols <- c("X", "ID", "Age", "Photo","Flag", "Club", "Club.Logo", "Special","Work.Rate","Skill.Moves", "Weak.Foot","Body.Type", "Real.Face", "Position", "Joined", "Loaned.From", "Contract.Valid.Until","Release.Clause...", "Kit.Number", "Best.Overall.Rating", "Year_Joined", "Potential", "Preferred.Foot", "International.Reputation", "Nationality")

data_fifa <- data_fifa %>% select(-all_of(cols))

data_fifa$Name <- gsub("[0-9]+", "", data_fifa$Name)

Now that we have cleaned a bit the other dataset we can proceed to merging it with this new and last one.

For this there is a problem, where FIFA players have different name formats (“Messi”, “L. Messi” or “Lionel Messi”). To solve this we will first standardized name to the format “L. Messi” and then extract the surnames for both datasets and join by means of the surname.

Also we will replace before all of this all characters from other languages into the closest one in the English alphabet (“è” -> “e”)

# Replacing characters by its closest English alphabet character
final_players$Player <- stri_trans_general(final_players$Player, "latin-ascii")
data_fifa$Name <- stri_trans_general(data_fifa$Name, "latin-ascii")

# Handle additional characters manually for specific languages 
final_players$Player <- gsub("ć|č", "c", final_players$Player)
final_players$Player <- gsub("ł", "l", final_players$Player)
final_players$Player <- gsub("ń", "n", final_players$Player)
final_players$Player <- gsub("ă", "a", final_players$Player)
final_players$Player <- gsub("ș|ş", "s", final_players$Player)
final_players$Player <- gsub("ž|ź", "z", final_players$Player)

# Removes other symbols
final_players$Player <- gsub("[^[:alnum:] ]", "", final_players$Player)  

data_fifa$Name <- gsub("ć|č", "c", data_fifa$Name)
data_fifa$Name <- gsub("ł", "l", data_fifa$Name)
data_fifa$Name <- gsub("ń", "n", data_fifa$Name)
data_fifa$Name <- gsub("ă", "a", data_fifa$Name)
data_fifa$Name <- gsub("ș|ş", "s", data_fifa$Name)
data_fifa$Name <- gsub("ž|ź", "z", data_fifa$Name)
data_fifa$Name <- gsub("[^[:alnum:] ]", "", data_fifa$Name)



# Set a Name column to the format "L. Messi"
data_fifa$StandardizedName <- paste0(substr(data_fifa$Name, 1, 1), ". ", sapply(strsplit(data_fifa$Name, " "), function(x) x[length(x)]))

# Create a new column of the surname
final_players$Surname <- sapply(strsplit(final_players$Player, " "), function(x) x[length(x)])
data_fifa$Surname <- sapply(strsplit(data_fifa$StandardizedName, " "), function(x) x[length(x)])

# Merge both datasets by Surname
full_data <- inner_join(final_players, data_fifa, by = "Surname")
## Warning in inner_join(final_players, data_fifa, by = "Surname"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 7480 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Eliminating repeated values when merging has not been correct
full_data <- na.omit(full_data)
full_data <- full_data %>%
  group_by(Player) %>%          
  # Keep the observation with the most time played
  slice_max(order_by = GoalsAllowed, n = 1, with_ties = FALSE) %>%    
  ungroup()

names = full_data[,1]

The chunk returned an unexpected many to many relationship problem, this could be due to players with the same surname.We tried different methods but they just did not work, as they eliminated many good players or provided more problems. This method maintained the best players and did not cause errors in the next steps

These merging returns our final dataset with no NA values nor repeated players with 1934 players and 65 different variables.

We observe that some observations have been lost. If we look closely at the data most of those observations not merged do not appear in the FIFA dataset so, since they are not very relevant we can exclude them confidently.

# Observations not merged
not_merged <- anti_join(final_players, full_data, by = "Player")

2.3 Outliers

Outliers are actually important as many players could have a fantastic season compared to others so it is important to keep them, since outliers in our dataset can represent very good players like Messi. But there are many players who are not in the managers rotation and they do not play much minutes so we will eliminate them as they will not provide useful information and are outliers after all.

ggplot(full_data) +
  aes(x = X90s)+ 
  geom_density(fill="lightblue", color="blue") + 
   theme_minimal()

X90s is a representation of the total minutes played divided by 90, which shows more or less the amount of games played. Most leagues have a total of 38 games of 90 minutes, with the added minutes, it makes sense the maximum value is 40. But what is really useful here is that there are many players who have not even played a single game and therefore will be irrelevant in our analysis, so we will remove them.

# Eliminating players that have not played much
full_data <- full_data[full_data$X90s > 2,]

2.4 Exploratory Data Analysis

We will now create some graphs to generate some insights and obtain information about the players.

Focusing now on goals and assists, as they are one of the most important variables as goals and assists make teams win games, we create this graph.

In this plot, we can see as expected that forwards have higher values, it is also remarkable that we have some goalkeepers with an assist or goal. Players above the line have scored less goals than expected while players further to the right and below the line have scored more goals than it was expected. Therefore it is desirable to be on the bottom right of the plot for great goals and assists conversion.

Also we can clearly see that there is a linear relationship between both variables.

ggplot(full_data)+
  aes(x = Assists+Goals, y = ExpectedGoals+ExpectedAssists, color = Pos)+
  geom_point(alpha=0.85)+
  geom_text_repel(
    data = subset(full_data, Assists + Goals > 25), 
    aes(label = Player),
    size = 2
  ) +
  geom_abline()+
  ggtitle(" Goals and Assists vs Expected Goals and Assists")+
  xlab("Total Goals + Assists") +                      
  ylab("Expected Goals + Assists") + 
  theme(plot.title = element_text(hjust = 0.5))

In this plot we focus on offensive players. Usually players that are more game-changing are targeted by opposing players, and fouling them is one of the most common strategies to stop them. This is why in this plot we compare fouls drawn and the success chance in dribbles.

In the plot we can see that players that are good dribblers are usually the most fouled, and also have many carries. The biggest example of this is Vinicius Jr.

ggplot(full_data)+
  aes(x = PercSuccessfulDribbles, y = FoulsDrawn, color = Pos, size=Carries)+
  geom_point(alpha = 0.85)+
  scale_size_continuous(range = c(1, 8)) +
  geom_text_repel(
    data = subset(full_data, FoulsDrawn > 70), 
    aes(label = Player),
    size = 2
  ) +
  ggtitle("Successful Dribble % vs Fouls Drawn")+
  xlab("Successful Dribble %") +                      
  ylab("Fouls Drawn") + 
  theme(plot.title = element_text(hjust = 0.5))
## Warning: ggrepel: 16 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

In this plot we focus on passes. With this density graph, we observe the range where most players fall into, but also we observe that as players tend to pass more, their accuracy also increases.

ggplot(full_data, aes(x=100*CompletedPasses/AttemptedPasses, y=PassingDistance/X90s)) +
  stat_density_2d(aes(fill = after_stat(level)), geom = "polygon", colour="white")+
  ggtitle("Pass Accuracy % vs Passing Distance per Game")+
  xlab("Pass Accuracy %") +                      
  ylab(" Avg. Passing Distance per Game") + 
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 133 rows containing non-finite values (`stat_density2d()`).

Next, we created a correlation plot for defensive actions. As we can see except Aerial Duels, the other ones are closely related. However, this defensive actions are more correlated in midfielders and sometimes forwards, than in defenders which is surprising.

# Select only the defensive columns for the correlogram
correlogram_data <- full_data[, c("SuccessfulTackles", "Blocks", "FoulsCommited", "RecoveredBalls", "PercAerialDuelsWon")]

# Plotting the correlogram
ggpairs(correlogram_data, 
        title = "Correlogram of Defensive Actions",
        lower = list(continuous = wrap("points", alpha = 0.3)),
        upper = list(continuous = wrap("cor", size = 4)), 
        ggplot2::aes(colour=full_data$Pos)) +
  theme(plot.title = element_text(hjust = 0.5)) 
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

Here we study goalkeepers. As we can see as the save % increases, the clean sheets also increase but the relation is not that clear, also there are some clear outliers, such as MarcAndre ter Stegen.

ggplot(full_data)+
  aes(x = 100*Saves/ShotsOnTargetAgainst, y = CleanSheets, color = Comp)+
  geom_point(alpha = 0.85)+
  geom_text_repel(
    data = subset(full_data, Saves/ShotsOnTargetAgainst > 0.6 & CleanSheets >= 10), 
    aes(label = Player),
    size = 2
  ) +
  ggtitle("Save % vs Clean Sheets")+
  xlab("Save %") +                      
  ylab("Clean Sheets") + 
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 1801 rows containing missing values (`geom_point()`).
## Warning: ggrepel: 24 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

2.5 Data transformation and scaling

Before using any algorithm on the data we have to check whether or not we scale the data, and also select the numeric columns and the ones we are interested in. We will start by choosing some numeric columns and plotting them to see their distribution.

numeric_columns <- full_data[, c("X90s", "TacklesDone", "SuccessfulTackles", "PercDribblersTackled", "Blocks", "TacklesANDIntercep", "Clearances", "ShotCreatingActions", "GoalCreatingActions","PointsPerMatch","GoalsByTeam", "GoalsAgainstTeam")] 

par(mfrow = c(3, 3)) # Adjust the layout for 4x4 grid to fit all histograms

for (col_name in names(numeric_columns)) {
  hist(
    numeric_columns[[col_name]], 
    main = paste("Histogram of", col_name), 
    xlab = col_name, 
    col = "skyblue", 
    border = "white"
  )
}

Most of these plots are left skewed, one reason for this is due to the number of goalkeepers, as these position has no value for these variables recorded. For goalkeeper stats it would happen exactly the same. Apart from that, most teams may have around 25 players, while the lineups are of 11 players, these leads to difficulties distributing the number of minutes between the players so the recorded stat for most players is low. Now we will use all numeric variables and see the relations and information we can obtain.

# Need to select numeric data from the dataset
numeric_data <- select_if(full_data, is.numeric)

# Using boxplots with the previous columns for a better visualization
boxplot(numeric_columns, las=2, col="darkblue")

These boxplots clearly show that we will need to scale the data before performing any algorithm on them. This is due to the difference of variance between each variables.

As the data is mostly left skewed, to fix this we can apply the square root to the numeric columns to center these distributions before scaling.

# Scaling the data
scaled_data <- scale(sqrt(numeric_data)) 

boxplot(scaled_data, las=2, col="darkblue")

The variables returned are now scaled, so there will be no big differences in variances. There are some very small boxplots in the middle, this is because this statistics refer to goalkeepers, where all players have a 0, and therefore the amount of outliers is big, but it actually corresponds to goalkeepers.

Let’s look at the correlation between the different variables, to see if we can obtain some insights before running any algorithm. Using the corrplot library for a better understanding.

correlation_matrix = cor(numeric_data)

corrplot(correlation_matrix, 
         method = "ellipse",       
         type = "lower",           
         order = "hclust",         
         tl.col = "black",         
         tl.srt = 45,              
         tl.cex = 0.5,             
         cl.cex = 0.5)

This shows the correlation for the different variables, some of these have high correlation which could affect the PCA. There are also some variables with no or very low correlation which is why the PCA could offer more information about the variables.

3. PRINCIPAL COMPONENT ANALYSIS

3.1 Motivation

The main objective of PCA is to visualize a high dimensional space in lower dimensions, where the new reduced variables (principal components) will be uncorrelated and linear combinations of the original one. This will also reduce redundancy and find relationships between variables.

The first principal component will show the direction of the variance. This component will be given by this formula:

\[ Z_1 = X \mathbf{a}_1 \]

Where here \(X\) represents the data matrix, \(Z_1\) the first components and \(\mathbf{a}_1\) a vector of loadings that shows the direction of the original feature map.

As we want to obtain the maximum information possible, we will maximize the variance of this component. The variance is given by these formula:

\[ s_{z_1}^2 = \mathbf{a}_1^T S \mathbf{a}_1 \]

Where \(S\) represents the covariance matrix of the data and \(\mathbf{a}_1\) a unit vector so it is not affected by the scaling of variables. Scaling is important in PCA as we want to center every variable, because if not elements with a higher scale will have a larger variance affecting our analysis.

As \(S\) is a covariance matrix we can decompose it by:

\[ S \mathbf{v} = \lambda \mathbf{v} \]

With this formula and the previous taking into account that \(\mathbf{a}_1\) is a unit vector, we will obtain that the maximum variance for the first component will be its eigenvalue. The direction will be given by \(\mathbf{a}_1\) , its eigenvector. This exact procedure will be the same for the rest of components. As two uncorrelated variables have covariance 0, the next component will be orthogonal. The matrix with the principal components score will be given by:

\[ Z = X A_r \]

Where \(Z\) is the matrix of the principal components, \(X\) the data matrix and \(A_r\) the principal component loadings.

PCA will be essential in our project as our current dataset has around 60 dimensions, so reducing these number of dimensions to have a more visual and clearer representation of data could be crucial.

#Scale is set to False as the data is already scaled and this could lead to errors
pca = prcomp(scaled_data, scale=F)

summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     5.0007 3.0047 2.5488 1.69557 1.40090 1.37227 1.18106
## Proportion of Variance 0.4387 0.1584 0.1140 0.05044 0.03443 0.03304 0.02447
## Cumulative Proportion  0.4387 0.5971 0.7111 0.76152 0.79595 0.82899 0.85346
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     1.03765 0.98572 0.85973 0.70234 0.68103 0.65859 0.6318
## Proportion of Variance 0.01889 0.01705 0.01297 0.00865 0.00814 0.00761 0.0070
## Cumulative Proportion  0.87235 0.88939 0.90236 0.91102 0.91915 0.92676 0.9338
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.58115 0.55705 0.54325 0.48847 0.45800 0.45072 0.43914
## Proportion of Variance 0.00593 0.00544 0.00518 0.00419 0.00368 0.00356 0.00338
## Cumulative Proportion  0.93969 0.94513 0.95031 0.95450 0.95818 0.96174 0.96512
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.42361 0.40737 0.38985 0.37542 0.36917 0.36127 0.34179
## Proportion of Variance 0.00315 0.00291 0.00267 0.00247 0.00239 0.00229 0.00205
## Cumulative Proportion  0.96827 0.97118 0.97385 0.97632 0.97871 0.98100 0.98305
##                           PC29   PC30    PC31    PC32    PC33   PC34    PC35
## Standard deviation     0.33488 0.3200 0.31569 0.28664 0.28353 0.2610 0.23393
## Proportion of Variance 0.00197 0.0018 0.00175 0.00144 0.00141 0.0012 0.00096
## Cumulative Proportion  0.98502 0.9868 0.98857 0.99001 0.99142 0.9926 0.99357
##                           PC36    PC37   PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.22057 0.20849 0.1992 0.18964 0.18005 0.15689 0.14800
## Proportion of Variance 0.00085 0.00076 0.0007 0.00063 0.00057 0.00043 0.00038
## Cumulative Proportion  0.99443 0.99519 0.9959 0.99652 0.99708 0.99752 0.99790
##                           PC43    PC44    PC45   PC46    PC47    PC48    PC49
## Standard deviation     0.14093 0.13548 0.11987 0.1063 0.10128 0.09973 0.09310
## Proportion of Variance 0.00035 0.00032 0.00025 0.0002 0.00018 0.00017 0.00015
## Cumulative Proportion  0.99825 0.99857 0.99882 0.9990 0.99920 0.99938 0.99953
##                           PC50    PC51    PC52    PC53    PC54    PC55    PC56
## Standard deviation     0.08590 0.07691 0.07118 0.06539 0.04943 0.03780 0.01830
## Proportion of Variance 0.00013 0.00010 0.00009 0.00008 0.00004 0.00003 0.00001
## Cumulative Proportion  0.99966 0.99976 0.99985 0.99993 0.99997 0.99999 1.00000
##                            PC57
## Standard deviation     0.006674
## Proportion of Variance 0.000000
## Cumulative Proportion  1.000000

3.2 Selecting the number of components

To select the number of clusters we will build up from the Kaiser Criterion. This criterion states that only the principal components with eigenvalues higher than 1 should appear in the analysis. It is of value greater than 1 as since each eigenvalue represents the variance contributed, a value lower than 1 would mean it explains less than the variables, so it is not useful.

However, this method does not deal well with noise, therefore we will use parallel analysis which will solve this issue. What parallel analysis does is that it compares the eigenvalues of the actual data with the ones generated with random data following the structure of the original data.

# Using factoextra package to see the variances explained by each component
fviz_eig(pca, addlabels = TRUE)

# Parallel analysis plot
fa.parallel(scaled_data, fa = "pc", show.legend = TRUE, main = "Parallel Analysis for PCA", sqrt=F)
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  7
# This line here returns 57 eigenvalues, one per column, it explains the amount of variance explained by each component. It also returns the eigenvectors, which show the value of each variable in each component.
get_eigenvalue(pca)
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.500693e+01     4.387181e+01                    43.87181
## Dim.2  9.028209e+00     1.583896e+01                    59.71077
## Dim.3  6.496474e+00     1.139732e+01                    71.10810
## Dim.4  2.874941e+00     5.043757e+00                    76.15185
## Dim.5  1.962532e+00     3.443038e+00                    79.59489
## Dim.6  1.883136e+00     3.303748e+00                    82.89864
## Dim.7  1.394900e+00     2.447193e+00                    85.34583
## Dim.8  1.076721e+00     1.888984e+00                    87.23482
## Dim.9  9.716420e-01     1.704635e+00                    88.93945
## Dim.10 7.391278e-01     1.296715e+00                    90.23617
## Dim.11 4.932761e-01     8.653966e-01                    91.10156
## Dim.12 4.638078e-01     8.136978e-01                    91.91526
## Dim.13 4.337446e-01     7.609554e-01                    92.67622
## Dim.14 3.991132e-01     7.001986e-01                    93.37642
## Dim.15 3.377297e-01     5.925082e-01                    93.96892
## Dim.16 3.103037e-01     5.443925e-01                    94.51332
## Dim.17 2.951213e-01     5.177566e-01                    95.03107
## Dim.18 2.386011e-01     4.185985e-01                    95.44967
## Dim.19 2.097674e-01     3.680129e-01                    95.81768
## Dim.20 2.031521e-01     3.564071e-01                    96.17409
## Dim.21 1.928440e-01     3.383228e-01                    96.51241
## Dim.22 1.794461e-01     3.148178e-01                    96.82723
## Dim.23 1.659480e-01     2.911368e-01                    97.11837
## Dim.24 1.519826e-01     2.666362e-01                    97.38500
## Dim.25 1.409393e-01     2.472619e-01                    97.63227
## Dim.26 1.362837e-01     2.390943e-01                    97.87136
## Dim.27 1.305144e-01     2.289727e-01                    98.10033
## Dim.28 1.168221e-01     2.049510e-01                    98.30528
## Dim.29 1.121416e-01     1.967397e-01                    98.50202
## Dim.30 1.024232e-01     1.796897e-01                    98.68171
## Dim.31 9.965747e-02     1.748377e-01                    98.85655
## Dim.32 8.216268e-02     1.441451e-01                    99.00070
## Dim.33 8.038759e-02     1.410309e-01                    99.14173
## Dim.34 6.814144e-02     1.195464e-01                    99.26127
## Dim.35 5.472286e-02     9.600502e-02                    99.35728
## Dim.36 4.864974e-02     8.535041e-02                    99.44263
## Dim.37 4.346663e-02     7.625725e-02                    99.51889
## Dim.38 3.969193e-02     6.963497e-02                    99.58852
## Dim.39 3.596366e-02     6.309414e-02                    99.65162
## Dim.40 3.241779e-02     5.687331e-02                    99.70849
## Dim.41 2.461459e-02     4.318348e-02                    99.75167
## Dim.42 2.190291e-02     3.842616e-02                    99.79010
## Dim.43 1.986060e-02     3.484316e-02                    99.82494
## Dim.44 1.835582e-02     3.220320e-02                    99.85715
## Dim.45 1.436804e-02     2.520709e-02                    99.88235
## Dim.46 1.129880e-02     1.982245e-02                    99.90217
## Dim.47 1.025808e-02     1.799662e-02                    99.92017
## Dim.48 9.946489e-03     1.744998e-02                    99.93762
## Dim.49 8.667846e-03     1.520675e-02                    99.95283
## Dim.50 7.378167e-03     1.294415e-02                    99.96577
## Dim.51 5.915887e-03     1.037875e-02                    99.97615
## Dim.52 5.067300e-03     8.889999e-03                    99.98504
## Dim.53 4.275215e-03     7.500376e-03                    99.99254
## Dim.54 2.443038e-03     4.286032e-03                    99.99683
## Dim.55 1.428939e-03     2.506910e-03                    99.99933
## Dim.56 3.348957e-04     5.875363e-04                    99.99992
## Dim.57 4.454102e-05     7.814214e-05                   100.00000

2 dimensions explain 59.7% of the variability while 3 dimensions explain 71.11%.

This next graph uses cos^2, which refers to the quality of representation of the different variables by each principal component. It returns how much of the variable`s variance is captured by that component.It returns values between 0 and 1 so it will be easy to interpret. A high value for cos2 means the variable is well represented by each of the principal components.

var <- get_pca_var(pca)
corrplot(
  var$cos2, 
  is.corr = FALSE, 
  method = "color",               
  type = "lower",                 
  tl.cex = 0.5,                   
  tl.srt = 45,                    
  cl.cex = 0.5,
  tl.col = "black",
  title = "Representation of Variables by Each Component",
  mar = c(0, 0, 2, 0))

We have decided to maintain 7 components.

This is based on our previous graphs. In the previous scree plot,apart from the eigenvalues obtained we see that the variance explained by 7 components is 85.34% which is a really good value. Apart from that, parallel analysis returned a value of 7 confirming and giving more backup to our statement. Moreover, on our most recent graph,it is easily seen that from Dim 15 to 57 there is basically no representation of the variables. Most variables are well represented in the first few dimensions , and all variables are represented in some way when the 7th dimension is reached.As we started with dimension 57 and we want to reduce dimensionality as much as possible while retaining as much variance as possible, the election of 7 components seems to be pretty good.

3.4 Interpretation and Visualization of components

# First Component
par(mfrow = c(1,2))
barplot(pca$rotation[, 1], 
        horiz = TRUE,        
        col = "darkblue",
        main = "First Principal Component",
        xlab = "Loading",
        las = 1,             
        cex.names = 0.5)
fviz_contrib(pca, choice = "var", axes = 1)

# Second Component
par(mfrow = c(1,2))

barplot(pca$rotation[, 2], 
        horiz = TRUE,        
        col = "darkblue",
        main = "Second Principal Component",
        xlab = "Loading",
        las = 1,             
        cex.names = 0.5)
fviz_contrib(pca, choice = "var", axes = 2)

# Third Component
par(mfrow = c(1,2))

barplot(pca$rotation[, 3], 
        horiz = TRUE,        
        col = "darkblue",
        main = "Third Principal Component",
        xlab = "Loading",
        las = 1,             
        cex.names = 0.5)
fviz_contrib(pca, choice = "var", axes = 3)

# Fourth Component
par(mfrow = c(1,2))

barplot(pca$rotation[, 1], 
        horiz = TRUE,        
        col = "darkblue",
        main = "Fourth Principal Component",
        xlab = "Loading",
        las = 1,             
        cex.names = 0.5)
fviz_contrib(pca, choice = "var", axes = 4)

# Fifth Component
par(mfrow = c(1,2))

barplot(pca$rotation[, 1], 
        horiz = TRUE,        
        col = "darkblue",
        main = "Fifth Principal Component",
        xlab = "Loading",
        las = 1,             
        cex.names = 0.5)
fviz_contrib(pca, choice = "var", axes = 5)

# Sixth Component
par(mfrow = c(1,2))

barplot(pca$rotation[, 6], 
        horiz = TRUE,        
        col = "darkblue",
        main = "Sixth Principal Component",
        xlab = "Loading",
        las = 1,             
        cex.names = 0.5)
fviz_contrib(pca, choice = "var", axes = 6)

# Seventh Component
par(mfrow = c(1,2))

barplot(pca$rotation[, 7], 
        horiz = TRUE,        
        col = "darkblue",
        main = "Seventh Principal Component",
        xlab = "Loading",
        las = 1,             
        cex.names = 0.5)
fviz_contrib(pca, choice = "var", axes = 7)

This shows the most relevant variables.

For the first component, these are touches and carries which are really important as well as passes done, attempted and shot creating actions.These variables clearly highlights midfielders who make a lot of runs and are usually the players who most balls touch and are responsible for generating the attack in most teams.

The second component gives a lot of importance to goalkeeper stats and focuses on Saves, Shots Against and Clean Sheets which are goalkeeper stats for which other players have no values. It also focuses on X90s and games played, these is also related to goalkeepers as most teams have one goalkeeper who plays most matches.

The third component is clearly for attacking positions as Expected Goals and Shots on Target are the most important variables, goals and shots also have high contributions as well as offsides which are usually committed by attackers

The fourth component are clearly the most famous players as they have the highest overall, as well as the highest value and wage, which are clearly the most relevant variables here.

The fifth component correspond to the weight and height, so this component will focus more on the physical conditions of the players.

The sixth component gives a lot of value to the age, both the oldest and youngest players.

Finally, the seventh component has a lot of value for points per match as well as overall, so usually the best players who provide a lot for the team, especially points

Now, we will use biplots to see the different relations between the components. As the number of variables was really high, we used different values of cos2 to filter the amount of variables to obtain a clear view of the graphs. Due to the high amount of components selected (7), and the fact that many of these components explained low variance in comparison to the first 3, we created a few biplots of the first 4 components with the first one which we think will be more than enough to see the relationship of data and obtain a good insight.

# Biplot for the 1st and 2nd principal components
fviz_pca_biplot(pca, axes = c(1, 2), 
                geom.ind = "point", 
                pointsize = 3,      
                label = "var",
                select.var = list(cos2 = 0.8),
                col.var = "blue",   
                col.ind = "gray",
                repel = T,
                )  

# Biplot for the 2nd and 3rd principal components
fviz_pca_biplot(pca, axes = c(1, 3), 
                geom.ind = "point",
                pointsize = 2,
                label = "var",
                select.var = list(cos2 = 0.85),
                col.var = "purple",
                col.ind = "gray",
                repel = T)

# Biplot for the 1st and 3rd principal components
fviz_pca_biplot(pca, axes = c(1, 4), 
                geom.ind = "point",
                pointsize = 2,
                select.var = list(cos2 = 0.7),
                label = "var",
                col.var = "black",
                col.ind = "gray",
                repel = T)

The first graph is a really good visual example, as one one side we see many points together which are more than likely to be the goalkeepers.Then we have the rest of players joined together, by looking at the arrows it is really clear that these point in the direction of the most variance explained, the orthogonal variables are the goalkeeper variables which explain Dim2.

The second graph, is similar due to that we have a bunch of points separated which are the goalkeepers, then we have the principal component 1 which clearly explain more variance with more points as it can easily be seen. This is also due to the fact that most teams have more midfielders in relation to other positions. Midfielders are usually a key part of the success of a team. The rest of points are on the direction of the orthogonal arrows, those points are likely to be the forwards.

The third graph, following the same pattern as before has some separate points which are more than likely to be the goalkeepers.Apart from that most of the variance is explained by the first component which are midfielders and therefore there are more points as most variance is being explained, while the fourth component took into account the wage, so points which are below are higher paid footballers.

3.5 Interpretation of PCA for individuals

We will now focus on individuals, leagues and nations to see which players most strongly represent the variance on the first components. Basically, the players with the best statistics of the variables that are better explained by the first principal components.

names <- full_data[[1]]

print("First Component")
## [1] "First Component"
names[order(-pca$x[,1])][1:10]
##  [1] "Bruno Fernandes"       "Lionel Messi"          "Enzo Le Fee"          
##  [4] "Joshua Kimmich"        "Bukayo Saka"           "Kieran Trippier"      
##  [7] "Antoine Griezmann"     "Trent AlexanderArnold" "Benjamin Bourigeaud"  
## [10] "Jude Bellingham"
# The sign here for the pca is negative, as the goalkeeper stats are inversely proportional to the first component and if not we would be getting goalkeepers instead

print("Second Component")
## [1] "Second Component"
names[order(pca$x[,2])][1:10]
##  [1] "Alisson"         "Alexander Nubel" "David de Gea"    "David Raya"     
##  [5] "Maxime Dupe"     "Aaron Ramsdale"  "Bernd Leno"      "Koen Casteels"  
##  [9] "Jordan Pickford" "Alban Lafont"
print("Third Component")
## [1] "Third Component"
names[order(pca$x[,3])][1:10]
##  [1] "Kylian Mbappe"      "Erling Haaland"     "Victor Osimhen"    
##  [4] "Folarin Balogun"    "Lionel Messi"       "Lois Openda"       
##  [7] "Callum Wilson"      "Robert Lewandowski" "Karim Benzema"     
## [10] "Mohamed Salah"

These individuals actually show our analysis of the PCA was correct as the players obtained in each of the components have the same position and characteristics we were mentioning in our interpretation above.

We also want to get an insight on the best football teams, the nations with the best players as well as the team with the best players

team <- full_data[[4]]
nation <- full_data[[2]]
comp <- full_data[[5]]

# Calculate the mean of PC1 (z1) by team
team_mean <- data.frame(z1 = pca$x[,1], team = team) %>%
  group_by(team) %>%
  summarise(mean_z1 = mean(z1)) %>%
  arrange(desc(mean_z1))

# Display the team_mean data frame
print(team_mean)
## # A tibble: 98 × 2
##    team          mean_z1
##    <chr>           <dbl>
##  1 Barcelona        2.35
##  2 Strasbourg       2.33
##  3 Arsenal          2.23
##  4 Paris S-G        2.04
##  5 Real Madrid      2.00
##  6 Lens             1.94
##  7 Lazio            1.66
##  8 RB Leipzig       1.47
##  9 Marseille        1.45
## 10 Newcastle Utd    1.21
## # ℹ 88 more rows
# Calculate the mean of PC1 (z1) by nation
nation_mean <- data.frame(z1 = pca$x[,1], nation = nation) %>%
  group_by(nation) %>%
  summarise(mean_z1 = mean(z1)) %>%
  arrange(desc(mean_z1))

# Display the nation_mean data frame
print(nation_mean)
## # A tibble: 96 × 2
##    nation mean_z1
##    <chr>    <dbl>
##  1 " RUS"    4.54
##  2 " SUR"    4.22
##  3 " BEN"    4.13
##  4 " TUN"    3.89
##  5 " ARM"    3.88
##  6 " ECU"    3.82
##  7 " MAD"    3.73
##  8 " EGY"    3.18
##  9 " JAM"    3.10
## 10 " JPN"    2.68
## # ℹ 86 more rows
# Calculate the mean of PC1 (z1) by competition
comp_mean <- data.frame(z1 = pca$x[,1], comp = comp) %>%
  group_by(comp) %>%
  summarise(mean_z1 = mean(z1)) %>%
  arrange(desc(mean_z1))

# Display the comp_mean data frame
print(comp_mean)
## # A tibble: 5 × 2
##   comp              mean_z1
##   <chr>               <dbl>
## 1 "Ligue 1"          0.236 
## 2 "La Liga"          0.0821
## 3 " Premier League"  0.0175
## 4 "Serie A"         -0.0878
## 5 "Bundesliga"      -0.268

We see that the league with the players who had statistics that best represented the variance was the Ligue1. Nation wise it was Russia and club wise it was FC Barcelona. This could also be due to that each of these categories has the higher number of players.

4. FACTOR ANALYSIS

4.1 Motivation

Factor analysis can be very useful in this case. As we know, the factor analysis searches for latent traits in the data, and so can it do in this dataset. Hidden features could range from offensive skill and defensive capability to physical fitness. Therefore it could help us interpret the performance on players based on broader terms rather than just individual statistics.

Combining all of this, factor analysis could be especially valuable if we decide to do any type of scouting, since we could use extracted factors to identify playmakers, dribblers or versatile players.

To finish the justification, it can also help us reduce the dimensionality of the data for training future algorithms as well as minimizing redundancy for highly correlated variables.

4.2 Selecting the number of factors

Before performing factor analysis, one of the key metrics to decide, is the number of factors we want to extract from the data. For this we will use parallel analysis, a method on which new random data is generated from our original data, and the eigenvalues of the actual and the new random data are compared.

As we know in factor analysis we have the covariance matrix:\[ \Sigma = LL^T + \Psi \] In reality as factors might be correlated, this matrix can be decomposed as:\[ \Sigma = L Cov(F) L^T + \Psi\]where \(L\) is the factor loading matrix , \(Cov(F)\) the covariance matrix of factors (with eigenvalues on the diagonal) and \(\Psi\) represents the covariance matrix of the unobserved stochastic error.

To assess whether or not the factor is significant, what we do is compare the eigenvalues (variance explained by each factor) with the mean eigenvalue from the random data. Then if at the corresponding position \(\lambda < \mathbb{E}[\lambda_{\text{random}}]\) we can say that this factor does not explain more variability than what would be expected of random noise, so we will not use this factor.

# Parallel analysis plot
fa.parallel(scaled_data, fa = "fa", show.legend = TRUE, main = "Parallel Analysis for Factor Analysis", sqrt=T)
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Parallel analysis suggests that the number of factors =  8  and the number of components =  NA

As we observe the real eigenvalues stay above the ones of random data, for factors from 1 to 8, so we will use 8 factors for factor analysis.

Before doing the factor analysis there are two choices have to be made. For rotation we chose “oblimin” since we know from before that many variables are really correlated, therefore the rotation matrix does not force the factors to be orthogonal. For score we chose “regression” since it gives the most accurate and unbiased estimations of factors when they may be correlated.

4.3 Interpretation of factors

# Performing factor analysis and extracting 8 factors
players.f <- fa(scaled_data, 8, rotate="oblimin", scores="regression")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
print(players.f, cut=0, digits=2)
## Factor Analysis using method =  minres
## Call: fa(r = scaled_data, nfactors = 8, rotate = "oblimin", scores = "regression")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                            MR2   MR1   MR8   MR7   MR3   MR4   MR6   MR5    h2
## Age                      -0.04  0.00  0.04  0.00  0.00  0.00 -0.01  1.00 0.992
## Born                      0.04  0.00 -0.04  0.01  0.00  0.00  0.02 -1.00 0.981
## X90s                      0.48  0.41  0.09  0.51  0.22  0.05 -0.01  0.05 0.985
## TacklesDone              -0.27  0.26  0.29  0.52 -0.21  0.05 -0.08 -0.07 0.893
## SuccessfulTackles        -0.26  0.27  0.27  0.51 -0.21  0.05 -0.09 -0.07 0.868
## PercDribblersTackled     -0.64  0.27 -0.03  0.12 -0.15  0.04 -0.05 -0.08 0.635
## Blocks                   -0.31  0.34  0.12  0.49  0.02  0.03 -0.04 -0.03 0.865
## TacklesANDIntercep       -0.29  0.35  0.19  0.52 -0.21  0.05 -0.07 -0.05 0.928
## Clearances               -0.35  0.58 -0.38  0.43  0.03  0.00  0.07  0.05 0.825
## ShotCreatingActions      -0.09  0.05  0.77  0.08  0.13  0.04 -0.07  0.01 0.969
## GoalCreatingActions       0.03  0.10  0.72 -0.06  0.23  0.01 -0.01 -0.02 0.807
## YellowCards              -0.31  0.16  0.00  0.50  0.12  0.00 -0.03  0.02 0.559
## RedCards                 -0.09  0.08 -0.08  0.17  0.04 -0.03 -0.01  0.02 0.059
## FoulsCommited            -0.36  0.09  0.13  0.51  0.25  0.01 -0.03 -0.03 0.791
## FoulsDrawn               -0.20  0.00  0.32  0.35  0.28  0.04 -0.14 -0.05 0.736
## Offsides                 -0.11 -0.21  0.12  0.02  0.70  0.00 -0.03  0.07 0.607
## RecoveredBalls           -0.28  0.44  0.24  0.40 -0.07  0.03 -0.06 -0.02 0.956
## PercAerialDuelsWon       -0.73  0.31 -0.17  0.10  0.10  0.02  0.05  0.02 0.678
## CompletedPasses          -0.22  0.75  0.15  0.13 -0.07  0.05 -0.06  0.02 0.979
## AttemptedPasses          -0.24  0.68  0.19  0.18 -0.05  0.04 -0.06  0.03 0.985
## PassingDistance          -0.23  0.78  0.09  0.15 -0.09  0.04 -0.05  0.03 0.974
## ForwardPassingDistance   -0.22  0.77  0.06  0.19 -0.18  0.03 -0.03  0.06 0.954
## Assists                   0.09  0.04  0.81 -0.11  0.10  0.00  0.03 -0.02 0.684
## ExpectedAssistsGoal       0.00 -0.01  0.90 -0.04  0.10  0.03 -0.02  0.02 0.932
## ExpectedAssists          -0.01  0.10  0.86 -0.05  0.06  0.03 -0.05  0.03 0.911
## KeyPasses                -0.04 -0.02  0.90  0.04  0.04  0.04 -0.05  0.03 0.946
## Touches                  -0.27  0.62  0.18  0.23  0.02  0.04 -0.06  0.02 0.992
## PercSuccessfulDribbles   -0.62  0.24  0.00  0.04  0.05  0.02 -0.03 -0.06 0.560
## Carries                  -0.22  0.68  0.18  0.13  0.06  0.04 -0.09 -0.01 0.966
## DistanceRanBall          -0.18  0.63  0.23  0.11  0.07  0.03 -0.10 -0.06 0.914
## ForwardDistanceRanBall   -0.17  0.68  0.20  0.08  0.03  0.03 -0.10 -0.07 0.890
## Dispossessed             -0.13 -0.23  0.52  0.23  0.37  0.01 -0.08 -0.10 0.799
## Goals                    -0.05  0.00  0.07 -0.04  0.84  0.03 -0.01  0.01 0.794
## Shots                    -0.18 -0.02  0.23  0.09  0.72  0.04 -0.04 -0.03 0.932
## ShotsOnTarget            -0.10 -0.05  0.19  0.01  0.79  0.03 -0.04 -0.03 0.895
## ShotDistance             -0.68 -0.07  0.36  0.07 -0.14  0.02 -0.14 -0.06 0.784
## ExpectedGoals            -0.15 -0.01  0.04  0.00  0.92  0.05 -0.02  0.00 0.961
## NonPenaltyExpectedGoals  -0.16  0.01  0.04  0.00  0.92  0.04 -0.02 -0.02 0.960
## GoalsAllowed              0.91 -0.08 -0.06  0.04 -0.13  0.00  0.05  0.04 0.974
## ShotsOnTargetAgainst      0.92 -0.07 -0.06  0.03 -0.12  0.01  0.05  0.04 0.989
## Saves                     0.92 -0.06 -0.06  0.03 -0.12  0.01  0.05  0.04 0.987
## Won                       0.89  0.00 -0.08 -0.02 -0.10  0.04  0.05  0.03 0.922
## Drawn                     0.90 -0.06 -0.06  0.03 -0.12  0.01  0.05  0.03 0.943
## Lost                      0.89 -0.09 -0.06  0.06 -0.12  0.00  0.04  0.04 0.940
## CleanSheets               0.89 -0.01 -0.07 -0.01 -0.10  0.05  0.05  0.03 0.909
## MatchesPlayed             0.34  0.25  0.19  0.44  0.29  0.04 -0.06  0.02 0.790
## Starts                    0.46  0.39  0.10  0.50  0.20  0.05 -0.02  0.05 0.944
## PointsPerMatch            0.17  0.65  0.09 -0.58  0.11  0.12  0.01 -0.01 0.408
## GoalsByTeam               0.50  0.69  0.15  0.05  0.26  0.09  0.02  0.02 0.920
## GoalsAgainstTeam          0.37  0.05  0.06  0.80  0.15  0.00  0.01  0.05 0.873
## ExepectedGoalsByTeam      0.51  0.66  0.13  0.13  0.26  0.09 -0.01  0.02 0.961
## ExpectedGoalsAgainstTeam  0.41  0.17  0.05  0.73  0.18  0.00  0.00  0.05 0.922
## Overall                  -0.01 -0.12 -0.03  0.05 -0.05  0.90  0.08  0.04 0.796
## Value...                  0.05  0.00 -0.03 -0.07  0.02  0.94 -0.03 -0.08 0.869
## Wage...                  -0.02 -0.06 -0.04 -0.06 -0.01  0.95  0.01  0.06 0.889
## Height.cm..              -0.06  0.05  0.10  0.07  0.00  0.01  0.99 -0.05 0.877
## Weight.lbs..             -0.08  0.03  0.07  0.04  0.02  0.04  0.88  0.04 0.705
##                              u2 com
## Age                      0.0075 1.0
## Born                     0.0186 1.0
## X90s                     0.0149 3.4
## TacklesDone              0.1066 3.4
## SuccessfulTackles        0.1317 3.3
## PercDribblersTackled     0.3648 1.6
## Blocks                   0.1349 2.8
## TacklesANDIntercep       0.0724 3.3
## Clearances               0.1746 3.5
## ShotCreatingActions      0.0313 1.1
## GoalCreatingActions      0.1933 1.3
## YellowCards              0.4412 2.1
## RedCards                 0.9412 2.8
## FoulsCommited            0.2093 2.6
## FoulsDrawn               0.2644 4.0
## Offsides                 0.3933 1.3
## RecoveredBalls           0.0438 3.4
## PercAerialDuelsWon       0.3225 1.6
## CompletedPasses          0.0206 1.4
## AttemptedPasses          0.0147 1.6
## PassingDistance          0.0255 1.3
## ForwardPassingDistance   0.0459 1.4
## Assists                  0.3158 1.1
## ExpectedAssistsGoal      0.0684 1.0
## ExpectedAssists          0.0886 1.1
## KeyPasses                0.0542 1.0
## Touches                  0.0077 1.9
## PercSuccessfulDribbles   0.4398 1.3
## Carries                  0.0340 1.5
## DistanceRanBall          0.0858 1.7
## ForwardDistanceRanBall   0.1101 1.4
## Dispossessed             0.2011 3.0
## Goals                    0.2056 1.0
## Shots                    0.0682 1.4
## ShotsOnTarget            0.1053 1.2
## ShotDistance             0.2161 1.8
## ExpectedGoals            0.0394 1.1
## NonPenaltyExpectedGoals  0.0396 1.1
## GoalsAllowed             0.0265 1.1
## ShotsOnTargetAgainst     0.0114 1.1
## Saves                    0.0134 1.1
## Won                      0.0783 1.1
## Drawn                    0.0566 1.1
## Lost                     0.0603 1.1
## CleanSheets              0.0906 1.1
## MatchesPlayed            0.2101 4.0
## Starts                   0.0558 3.4
## PointsPerMatch           0.5924 2.3
## GoalsByTeam              0.0797 2.3
## GoalsAgainstTeam         0.1265 1.5
## ExepectedGoalsByTeam     0.0389 2.5
## ExpectedGoalsAgainstTeam 0.0778 1.9
## Overall                  0.2044 1.1
## Value...                 0.1308 1.0
## Wage...                  0.1107 1.0
## Height.cm..              0.1233 1.0
## Weight.lbs..             0.2951 1.0
## 
##                         MR2  MR1  MR8  MR7  MR3  MR4  MR6  MR5
## SS loadings           11.07 9.46 7.83 6.54 6.23 2.89 2.34 2.20
## Proportion Var         0.19 0.17 0.14 0.11 0.11 0.05 0.04 0.04
## Cumulative Var         0.19 0.36 0.50 0.61 0.72 0.77 0.81 0.85
## Proportion Explained   0.23 0.19 0.16 0.13 0.13 0.06 0.05 0.05
## Cumulative Proportion  0.23 0.42 0.58 0.72 0.85 0.91 0.95 1.00
## 
##  With factor correlations of 
##       MR2   MR1   MR8   MR7   MR3  MR4   MR6   MR5
## MR2  1.00 -0.20 -0.26 -0.09 -0.05 0.04  0.25  0.19
## MR1 -0.20  1.00  0.42  0.58  0.18 0.21 -0.14  0.02
## MR8 -0.26  0.42  1.00  0.34  0.62 0.25 -0.41 -0.07
## MR7 -0.09  0.58  0.34  1.00  0.19 0.07 -0.11  0.03
## MR3 -0.05  0.18  0.62  0.19  1.00 0.21 -0.16 -0.02
## MR4  0.04  0.21  0.25  0.07  0.21 1.00  0.09  0.13
## MR6  0.25 -0.14 -0.41 -0.11 -0.16 0.09  1.00  0.11
## MR5  0.19  0.02 -0.07  0.03 -0.02 0.13  0.11  1.00
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 8 factors are sufficient.
## 
## df null model =  1596  with the objective function =  139.74 with Chi Square =  267344.5
## df of  the model are 1168  and the objective function was  29.25 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic n.obs is  1934 with the empirical chi square  1266.17  with prob <  0.023 
## The total n.obs was  1934  with Likelihood Chi Square =  55803.99  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.718
## RMSEA index =  0.156  and the 90 % confidence intervals are  0.154 0.157
## BIC =  46965.33
## Fit based upon off diagonal values = 1

As we can see from the output the total variance explained by all factors is ~85%, but we can also observe that the variance explained by each factor alone decreases until it goes flat in about a 5% for the last three.

To comprehend better how each factor is impacted by all variables, we make the following plot.

# Extract loadings and convert them to a df
loadings <- as.data.frame(players.f$loadings[])
loadings$Variable <- rownames(loadings)

# 1st loading plot

loadings$Variable <- factor(loadings$Variable, levels = loadings$Variable[order(loadings$MR1, decreasing = T)])

factor1_plot <- ggplot(loadings, aes(x = Variable, y = MR1, fill = MR1 > 0)) +
  geom_col(color="black") +  
  labs(title = "1st Factor Loadings for Each Variable", 
       x = "Variables", 
       y = "1st Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
        strip.text = element_text(size = 12),  
        plot.title = element_text(size = 16),  
        axis.title = element_text(size = 12)) +  
  scale_color_manual(values = c("red", "blue"), 
                     name = "Loading Sign", 
                     labels = c("Negative", "Positive"))


# 2nd loading plot

loadings$Variable <- factor(loadings$Variable, levels = loadings$Variable[order(loadings$MR2, decreasing = T)])

factor2_plot <- ggplot(loadings, aes(x = Variable, y = MR2, fill = MR2 > 0)) +
  geom_col(color="black") + 
  labs(title = "2nd Factor Loadings for Each Variable", 
       x = "Variables", 
       y = "2nd Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
        strip.text = element_text(size = 12),  
        plot.title = element_text(size = 16),  
        axis.title = element_text(size = 12)) +  
  scale_color_manual(values = c("red", "blue"), 
                     name = "Loading Sign", 
                     labels = c("Negative", "Positive"))

# 3rd loading plot

loadings$Variable <- factor(loadings$Variable, levels = loadings$Variable[order(loadings$MR3, decreasing = T)])

factor3_plot <- ggplot(loadings, aes(x = Variable, y = MR3, fill = MR3 > 0)) +
  geom_col(color="black") +  
  labs(title = "3rd Factor Loadings for Each Variable", 
       x = "Variables", 
       y = "3rd Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
        strip.text = element_text(size = 12),  
        plot.title = element_text(size = 16),  
        axis.title = element_text(size = 12)) +  
  scale_color_manual(values = c("red", "blue"), 
                     name = "Loading Sign", 
                     labels = c("Negative", "Positive"))

# 4th loading plot

loadings$Variable <- factor(loadings$Variable, levels = loadings$Variable[order(loadings$MR4, decreasing = T)])

factor4_plot <- ggplot(loadings, aes(x = Variable, y = MR4, fill = MR4 > 0)) +
  geom_col(color="black") +  
  labs(title = "4th Factor Loadings for Each Variable", 
       x = "Variables", 
       y = "4th Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
        strip.text = element_text(size = 12),  
        plot.title = element_text(size = 16),  
        axis.title = element_text(size = 12)) + 
  scale_color_manual(values = c("red", "blue"), 
                     name = "Loading Sign", 
                     labels = c("Negative", "Positive"))

# 5th loading plot

loadings$Variable <- factor(loadings$Variable, levels = loadings$Variable[order(loadings$MR5, decreasing = T)])

factor5_plot <- ggplot(loadings, aes(x = Variable, y = MR5, fill = MR5 > 0)) +
  geom_col(color="black") +  
  labs(title = "5th Factor Loadings for Each Variable", 
       x = "Variables", 
       y = "5th Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
        strip.text = element_text(size = 12),  
        plot.title = element_text(size = 16),  
        axis.title = element_text(size = 12)) +  
  scale_color_manual(values = c("red", "blue"), 
                     name = "Loading Sign", 
                     labels = c("Negative", "Positive"))


# 6th loading plot

loadings$Variable <- factor(loadings$Variable, levels = loadings$Variable[order(loadings$MR6, decreasing = T)])

factor6_plot <- ggplot(loadings, aes(x = Variable, y = MR6, fill = MR6 > 0)) +
  geom_col(color="black") +  
  labs(title = "6th Factor Loadings for Each Variable", 
       x = "Variables", 
       y = "6th Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
        strip.text = element_text(size = 12),  
        plot.title = element_text(size = 16),  
        axis.title = element_text(size = 12)) +  
  scale_color_manual(values = c("red", "blue"), 
                     name = "Loading Sign", 
                     labels = c("Negative", "Positive"))

# 7th loading plot

loadings$Variable <- factor(loadings$Variable, levels = loadings$Variable[order(loadings$MR7, decreasing = T)])

factor7_plot <- ggplot(loadings, aes(x = Variable, y = MR7, fill = MR7 > 0)) +
  geom_col(color="black") +
  labs(title = "7th Factor Loadings for Each Variable", 
       x = "Variables", 
       y = "7th Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
        strip.text = element_text(size = 12),  
        plot.title = element_text(size = 16),  
        axis.title = element_text(size = 12)) +
  scale_color_manual(values = c("red", "blue"), 
                     name = "Loading Sign", 
                     labels = c("Negative", "Positive"))

# 8th loading plot

loadings$Variable <- factor(loadings$Variable, levels = loadings$Variable[order(loadings$MR8, decreasing = T)])

factor8_plot <- ggplot(loadings, aes(x = Variable, y = MR8, fill = MR8 > 0)) +
  geom_col(color="black") +  
  labs(title = "8th Factor Loadings for Each Variable", 
       x = "Variables", 
       y = "8th Loading") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
        strip.text = element_text(size = 12),  
        plot.title = element_text(size = 16),  
        axis.title = element_text(size = 12)) +  
  scale_color_manual(values = c("red", "blue"), 
                     name = "Loading Sign", 
                     labels = c("Negative", "Positive"))

# Combining plots
(factor1_plot + factor2_plot) / (factor3_plot + factor4_plot) / (factor5_plot + factor6_plot) / (factor7_plot + factor8_plot)

For the first factor we observe that variables related to passes and game creation are the most influential, so the latent factor could be midfielders or game-creators. For the second factor goalkeeper variables influence it very positively, while variables of defenders, like tackles are very negative, so the latent factor could be goalkeepers (positive) and defenders (negative). For the last factor we see that offensive-game creators, the ones that create possible goal actions, are really high, so the latent factor should be players with this trait.

For the rest, we observe that the data we added from the FIFA dataset seems to have its own factors, like factor 4 that explains high-value players. Moreover, interestingly we observe that the 5th factor only tries to explain older or younger players.

4.4 Interpretation of FA for players

Next, we will see who are the players that score highest and lowest for each factor, to gain more insights about them and confirm what we discussed just before.

# Extracting factor scores
factor_scores <- players.f$scores
# The current order of factors
current_order <- c(2, 1, 8, 7, 3, 4, 6, 5)
num_factors <- ncol(factor_scores)  

# List of largest and smallest observations
largest_observations <- list()
smallest_observations <- list()

# Get idx of the largest and smallest scoring observations for each factor
for (i in 1:num_factors) {
  largest_idx <- which.max(factor_scores[, i])
  smallest_idx <- which.min(factor_scores[, i])
  # Store the idx in the lists
  largest_observations[[i]] <- largest_idx
  smallest_observations[[i]] <- smallest_idx
}

# Printing the results
for (i in 1:num_factors) {
  largest_idx <- largest_observations[[i]]
  smallest_idx <- smallest_observations[[i]]
  largest_name <- full_data$Player[largest_idx]
  smallest_name <- full_data$Player[smallest_idx]
  
  cat("Factor", current_order[i], "\n")
  cat("Largest observation index:", largest_name, "with score:", factor_scores[largest_idx, i], "\n")
  cat("Smallest observation index:", smallest_name, "with score:", factor_scores[smallest_idx, i], "\n\n")
}
## Factor 2 
## Largest observation index: David Raya with score: 5.520923 
## Smallest observation index: Gerard Pique with score: -0.8826895 
## 
## Factor 1 
## Largest observation index: Lewis Dunk with score: 3.465045 
## Smallest observation index: Wayne Hennessey with score: -2.040182 
## 
## Factor 8 
## Largest observation index: Bruno Fernandes with score: 3.596838 
## Smallest observation index: Diego Gonzalez with score: -1.942711 
## 
## Factor 7 
## Largest observation index: Joao Palhinha with score: 2.938966 
## Smallest observation index: Manu Vallejo with score: -2.376623 
## 
## Factor 3 
## Largest observation index: Erling Haaland with score: 4.662395 
## Smallest observation index: Vlad Chiriches with score: -1.676948 
## 
## Factor 4 
## Largest observation index: Kylian Mbappe with score: 4.100426 
## Smallest observation index: Sam Johnstone with score: -1.658215 
## 
## Factor 6 
## Largest observation index: Fraser Forster with score: 2.988086 
## Smallest observation index: Duvan Zapata with score: -3.119739 
## 
## Factor 5 
## Largest observation index: Joaquin with score: 3.1982 
## Smallest observation index: Leny Yoro with score: -2.406709

As we can see with the output, the hidden factors are clear. For example for factor 5 (the “age” factor), the highest one is Joaquin, a player that recently retired, while the lowest one is a player that is now 19. For factor 8, we get Bruno Fernandes, which is an offensive play-maker at Manchester United, just as we explained before. Furthermore, for factor 3 we observe that the highest-scoring player is Haaland, one of the best forwards in the world, and as we look closely at the variables that impact that factor, they are all goal-related.

Finally, we observe that although this analysis does pretty well, it is not perfect, for example in factor 1 (midfielder focused) the highest scoring player is Lewis Dunk an average Premier League midfielder, while Rodri the best player at this places second just behind him.

As a conclusion, we have pretty reasonably performed factor analysis which has given us meaningful hidden factors that could be used from scouting to improve a players performance, or many other possibilities.

5. CLUSTERING

5.1 Motivation

As we have discussed before, one of the main goals of the project is to classify players into different groups. One very easy example is well-performing players against bad-performing players, but we could gain even more insights. Due to this clustering is a key-part for this player analysis.

5.2 K-means

5.2.1 Initial Clustering

Before we move on to fancier methods for k-means, such as kernel-based and choosing thoroughly the number of clusters, we will do a basic initial k-means. The number of cluster will be 4 since at first we think that the player’s position could be a very good way to initially describe them.

# Initial number of clusters
k <- 4

# Training an initial k-means
initial_kmeans <- eclust(scaled_data, "kmeans", stand=TRUE, k=k)

# Silhouette plot
fviz_silhouette(initial_kmeans)
##   cluster size ave.sil.width
## 1       1  133          0.50
## 2       2  457          0.15
## 3       3  696          0.29
## 4       4  648          0.18

# Perform PCA on scaled data
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
# Get clusters and position of each observation
pca_cluster_data <- as.data.frame(pca_result$x) %>%
  mutate(Cluster = initial_kmeans$cluster,
         Position = full_data$Pos)  

# Plot clustering with position diferentiation
ggplot(pca_cluster_data, aes(x = PC1, y = PC2, color = as.factor(Cluster))) + 
  geom_point(aes(shape = Position, alpha=0.8))+
  scale_shape_manual(values = 1:k) +
  labs(title = "PCA of Clusters with Position Differentiation",
       x = "Principal Component 1", y = "Principal Component 2") +
  theme_minimal()

As we observe from the silhouette plot, the clusters seem to have a good silhouette score, so the number of clusters may not bee too far away from the optimal ones.

When we plot the clustering against the positions (our initial hypothesis), we observe that the differentiation between them are not that clear. Goalkeepers have their own cluster, as expected, while we see a cluster for maybe more offensive players like midfielders and forwards, and another one for more defensive players, which could be defensive-midfielders or defenders. For the last cluster, it is hard to explain since the position does not seem to matter, therefore it could be classifying average players.

5.2.2 Selecting the number of clusters

Choosing the number of clusters is a very difficult topic, but with a combination of different statistics we can have more confidence on our answer. The three metrics we will use are the Within-Cluster Sum of Squares (WSS), the Silhouette Score and the Gap Statistic.

The WSS computes the squared distance between each point in the cluster and the centroid of that cluster. Mathematically (since we will use the Euclidean distance): \[\text{WSS}(k) = \sum_{q=1}^{k} \sum_{x \in C_q} \| x - \mu_q \|^2\]

Where \(k\) is the number of clusters, \(C_q\) the data points in the cluster, and \(\mu_q\) the centroid of the cluster \(q\). With this we know that the WSS is a metric that measures how compact a cluster is.

Note that we want the clusters to be as compact as possible, but adding clusters only makes the clusters more compact, so we try to minimize both the WSS and the number of clusters at the same time (elbow method).

The Silhoeutte Score measures how similar each observation is to its cluster compared to the others. It ranges from -1 to 1, and the higher it is, the more confident we are that the clustering has been correct. Here the mathematical definition: \[\text{Silhouette Score} = \frac{1}{n} \sum_{i=1}^{n} s(i)\]

Where: \[s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}\] \[a(i) = \frac{1}{|C_i| - 1} \sum_{\substack{j \in C_i \\ j \neq i}} d(i, j)\] \[b(i) = \min_{C_j \neq C_i} \left\{ \frac{1}{|C_j|} \sum_{l \in C_j} d(i, l) \right\}\]

\(a(i)\) is the average intra-cluster distance, \(b(i)\) is the minimum average distance between the observation and the points in other cluster (inter-cluster distance), \(s(i)\) is the silhouette score for a single point and \(d(i, l)\) in our case is the Euclidean distance.

Finally the Gap Statistic compares the within-cluster variation for different number of clusters (\(k\)), with the expected values of this variation of a reference distribution. In practice we try to maximize it while still having as low clusters as possible. Mathematically: \[W_k = \sum_{q=1}^{k} \sum_{x \in C_q} \| x - \mu_q \|^2\]

Where \(W_k\) is the within-cluster dispersion of the original dataset, \(C_q\) is the cluster \(q\) , \(\mu_q\) the centroid of the cluster, and \(\| x - \mu_q \|^2\) is the Euclidean distance we are using. With this we can compute the Gap Statistic: \[\text{Gap}(k) = \mathbb{E}_{P}[\log(W_k^*)] - \log(W_k)\] \[\mathbb{E}_{P}[\log(W_k^*)] \approx \frac{1}{B} \sum_{b=1}^{B} \log(W_k^{*(b)})\]

Where \(W_k^{*(b)}\) is the within-cluster dispersion of reference dataset number \(b\), since there are \(B\) reference datasets \(\{X_b^* \}_{b=1}^{B}\).

To choose even more confidently the number of clusters we will calculate those metrics in our normal dataset, another one without goalkeepers, and a PCA reduced version of those two. For this we first calculate these datasets.

# Filtering out goalkeepers
full_data_no_gk <- full_data %>% filter(Pos != "GK")
numeric_data_no_gk <- select_if(full_data_no_gk, is.numeric)

# One-hot encoding the positions for more meaningful clustering
dummy <- dummyVars("~ Pos", data = full_data_no_gk)
one_hot_encoded_pos <- predict(dummy, newdata = full_data_no_gk)
# Combining numeric data and the one-hot encoded columns
numeric_data_no_gk <- cbind(numeric_data_no_gk, one_hot_encoded_pos)
# Eliminating goalkeeper columns from non-goalkeeper data
numeric_data_no_gk <- numeric_data_no_gk[, !colnames(numeric_data_no_gk) %in% c("GoalsAllowed", "ShotsOnTargetAgainst", "Saves", "Won", "Drawn", "Lost", "CleanSheets")] 
# Scaling goalkeeper columns
scaled_data_no_gk <- scale(sqrt(numeric_data_no_gk))

# One-hot encodingg the poisition in the normal dataset
one_hot_encoded_pos <- predict(dummy, newdata = full_data)
# Combining numeric data and the one-hot encoded columns
numeric_data <- cbind(numeric_data, one_hot_encoded_pos)
scaled_data <- scale(sqrt(numeric_data))

# Getting PCA datasets for goalkeeper and non-goalkeeper data
pca_result <- prcomp(scaled_data)
pca_result_no_gk <- prcomp(scaled_data_no_gk)
# Extract the scores for the first 7 principal components as before
pca_data <- as.data.frame(pca_result$x[, 1:7])
pca_data_no_gk <- as.data.frame(pca_result_no_gk$x[, 1:7])

Here we calculate these metrics for the 4 datasets (this cell takes some time to run).

# Range of k to try
k_values <- 2:10

# Function to calculate and plot metrics for cluster comparison 
cluster_comparison <- function(data) {
  # Storing WSS, Silhouette Scores and Gap Statistic
  wss_values <- numeric(length(k_values))
  silhouette_values <- numeric(length(k_values))
  gap_stat_values <- numeric(length(k_values))

  # Calculating the Gap Statistic
  gap_stat <- clusGap(data, FUN = kmeans, K.max = max(k_values), B = 50)
  
  # Calculating each metric for each k
  for (i in 1:length(k_values)) {
    k <- k_values[i]
    # Fit the k-means model
    kmeans_result <- kmeans(data, centers = k, nstart = 25)
    
    # Storing WSS value
    wss_values[i] <- kmeans_result$tot.withinss
    # Computing the silhouette score
    cluster_assignments <- kmeans_result$cluster
    dist_matrix <- dist(data)  
    silhouette_score <- silhouette(cluster_assignments, dist_matrix)
    silhouette_values[i] <- mean(silhouette_score[, 3])  
    # Storing the Gap Statistic
    gap_stat_values[i] <- gap_stat$Tab[i, "gap"]
  }

  # Plot WSS 
  plot(k_values, wss_values, type = "b", pch = 19, frame = FALSE,
       xlab = "Number of Clusters K",
       ylab = "Within-Cluster Sum of Squares (WSS)",
       main = "Elbow Method for Optimal K")

  # Plot avg. Silhouette Score
  plot(k_values, silhouette_values, type = "b", pch = 19, frame = FALSE,
       xlab = "Number of Clusters K",
       ylab = "Average Silhouette Score",
       main = "Silhouette Method for Optimal K")

  # Plot Gap Statistic
  plot(k_values, gap_stat_values, type = "b", pch = 19, frame = FALSE,
       xlab = "Number of Clusters K",
       ylab = "Gap Statistic",
       main = "Gap Statistic for Optimal K")
}

# Computing metrics of the 4 datasets
cluster_comparison(scaled_data)
## Warning: did not converge in 10 iterations

cluster_comparison(scaled_data_no_gk)
## Warning: did not converge in 10 iterations

cluster_comparison(pca_data)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

cluster_comparison(pca_data_no_gk)

If we look at the WSS from all four datasets there is no clear elbow but it is usually at 4 to 6 where the decrease in the WSS stops being significant. So those 3 are candidates for number of clusters.

Looking at the Silhouette Score all four datasets peak at k=2, which seems to be because there is a clear separation between goalkeepers and the rest of players, but there is no huge difference between field players.

Finally the Gap Statistic usually has an elbow around k=5, which again makes 5 or 6 candidates for the optimal number of clusters.

Seems that all metrics point to different directions, especially the Silhouette Score, but we can see where the weakness of this metric is its mathematical definition above. When the intra and inter cluster distance are really similar the Silhouette Score lowers, while this may be correct for datasets with highly differentiated classes this is not our case, since for example a player can play both attacking and defensive positions. This is why the Silhouette Score is not reliable in this case.

On the other hand, again looking at the mathematical expression of the Gap Statistic, we can see that it performs really well for data were the classes are more broad, since we are comparing when clusters are relevant to when they only may explain noise. Therefore in our case the Gap Statistic will be the most reliable metric. Taking this into account we will choose 5 as the number of clusters for K-means.

5.2.3 Interpretation of clusters

For the dataset to train the K-means on, we chose the scaled dataset and not the one with PCA, because although the PCA may eliminate some noise from the original dataset, it also has almost the same number of variables as the number of clusters, which could cause some overfitting problems.

# Training final K-means
final_kmeans <- eclust(scaled_data, "kmeans", stand=TRUE, k=5)

# Silhouette plot
fviz_silhouette(final_kmeans)
##   cluster size ave.sil.width
## 1       1  133          0.52
## 2       2  379          0.15
## 3       3  548          0.23
## 4       4  569          0.18
## 5       5  305          0.12

# Perform PCA on scaled data
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)

# Plotting the number of observations in each cluster
groups <- final_kmeans$cluster
barplot(table(groups), col="blue")

centers <- final_kmeans$centers
# Variable importance in each cluster's center
barplot(centers[1,], las=2, col="darkblue")

barplot(centers[2,], las=2, col="darkblue")

barplot(centers[3,], las=2, col="darkblue")

barplot(centers[4,], las=2, col="darkblue")

barplot(centers[5,], las=2, col="darkblue")

# Extracting PCA scores
pca_cluster_data <- pca_data %>%
  mutate(Cluster = as.factor(final_kmeans$cluster),
         Position = full_data$Pos,
         GoalsAssists = full_data$Goals + full_data$Assists)  

# Cluster plot with position differentiation
ggplot(pca_cluster_data, aes(x = PC1, y = PC2, color = Cluster, size = GoalsAssists)) + 
  geom_point(aes(shape = Position, alpha=0.8))+
  labs(title = "PCA of Clusters with Position Differentiation",
       x = "Principal Component 1", y = "Principal Component 2") +
  theme_minimal()

Firstly, we can see a huge division between goalkeepers and the rest of the data, as expected. We can also see that the silhouette score is not the best one, but as we discussed before, for our dataset this metric is not too helpful. Moreover, we can see some unbalances in classes, which is obvious since, for example the number of goalkeepers is fixed.

We clearly see that Cluster 1 is for goalkeepers. Cluster 2 is mostly forward-related but with also a big emphasis on not creating much game (passes), so center forwards could qualify for this. Cluster 3 is for players which have very small contributions, so average players or ones that have not played too much. Cluster 4 focuses on passes and defensive statistics, so defensive midfielders and defenders. Finally, Cluster 5 focuses on offensive players, but this time more goal-related, so many forwards should be in this clusters.

# Creating a data frame with cluster assignments and positions
pca_cluster_data <- pca_data %>%
  mutate(Cluster = final_kmeans$cluster, Position = full_data$Pos)

# Calculating counts of each position in each cluster
position_counts <- pca_cluster_data %>%
  group_by(Cluster, Position) %>%
  summarise(count = n()) %>%
  ungroup()
## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.
# Calculating the total number of observations in each cluster
cluster_totals <- pca_cluster_data %>%
  group_by(Cluster) %>%
  summarise(total = n())

# Merging the position counts with cluster totals to calculate percentages
position_percentages <- position_counts %>%
  left_join(cluster_totals, by = "Cluster") %>%
  mutate(percentage = (count / total) * 100)

# Plotting the percentage of each position in each cluster
ggplot(position_percentages, aes(x = as.factor(Cluster), y = percentage, fill = Position)) +
  geom_bar(stat = "identity", position = "stack", color = "black") +
  labs(title = "Position Percentages in Each Cluster",
       x = "Cluster",
       y = "Percentage") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) + 
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(title = "Position"))

In this plot we can confirm our hypothesis, Cluster 1 is for goalkeepers. Cluster 2 for traditional forwards. Cluster 3 for average or not good players, that’s why the different types of positions in the cluster. Cluster 4 for defenders and defensive midfielders. And Cluster 5 for offensive play-makers offensive midfielders, forwards and even some defenders!

5.3 PAM

In K-medoids the center of a cluster is not an abstract point, but an observation in the data, this is the main difference between it and K-means. This makes K-medoids a very stable algorithm in some cases, which could be ours, since having a reference player for a type of player seems like a good idea.

5.3.1 Selecting the number of clusters

For this we will would assume that the number of optimal clusters is the same as we calculated before in K-means. But to check we will calculate the Silhouette Score and the Gap Statistic. (This cell will take some time to run)

# Calculating Silhouette score and Gap statistic
fviz_nbclust(scaled_data, pam, method = 'silhouette', k.max = 10)

fviz_nbclust(scaled_data, pam, method = 'gap_stat', k.max = 10, nboot = 10)

We observe that our guess of 5 could be a good value for k again if we follow the elbow method, so we will choose it.

5.3.2 Interpretation of clusters

# Training final PAM clustering
final_pam <- eclust(scaled_data, "pam", stand = TRUE, k = 5, graph = FALSE)

# Extracting cluster assignments
cluster_assignments <- factor(final_pam$cluster)

# Plotting the clustering in 2D
fviz_cluster(final_pam, data = scaled_data, geom = c("point"), pointsize=1) +
  theme_minimal() +
  geom_text(aes(label = names, color = factor(final_pam$cluster), alpha=0.8), hjust = 0, vjust = 0, size = 2, check_overlap = TRUE) +
  scale_fill_brewer(palette = "Paired") +
  scale_color_brewer(palette = "Paired")

# Adding cluster and player names
pca_data_3d <- as.data.frame(pca_data) %>%
  mutate(Cluster = cluster_assignments, Names = full_data$Player, Position=full_data$Pos)

# Making a 3D plot of the clustering
plot_ly(
  data = pca_data_3d,
  x = ~PC1,
  y = ~PC2,
  z = ~PC3,
  color = ~Cluster, 
  text = ~paste("Name:", Names, "<br>Position:", Position),
  type = "scatter3d",
  mode = "markers",
  marker = list(size = 4, opacity = 0.8),
  hoverinfo = "text"
)

By the plots we can see that the clustering has remained almost the same as in K-means. But just checking by eye is not enough so we will do some plots to obtain a meaningful answer.

# Calculating Silhouette score of the final PAM
fviz_silhouette(final_pam)
##   cluster size ave.sil.width
## 1       1  606          0.16
## 2       2  133          0.51
## 3       3  439          0.09
## 4       4  323          0.14
## 5       5  433          0.17

# Performing PCA on scaled data
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)

# Plotting the number of observations in each cluster
groups <- final_pam$cluster
barplot(table(groups), col="blue")

# Variable importance in each cluster's center observation
centers <- final_pam$medoids
barplot(centers[1,], las=2, col="darkblue")

barplot(centers[2,], las=2, col="darkblue")

barplot(centers[3,], las=2, col="darkblue")

barplot(centers[4,], las=2, col="darkblue")

barplot(centers[5,], las=2, col="darkblue")

# Creating a data frame with cluster assignments and positions
pca_cluster_data <- pca_data %>%
  mutate(Cluster = final_pam$cluster, Position = full_data$Pos)

# Calculating counts of each Position in each Cluster
position_counts <- pca_cluster_data %>%
  group_by(Cluster, Position) %>%
  summarise(count = n()) %>%
  ungroup()
## `summarise()` has grouped output by 'Cluster'. You can override using the
## `.groups` argument.
# Calculating the total number of observations in each Cluster
cluster_totals <- pca_cluster_data %>%
  group_by(Cluster) %>%
  summarise(total = n())

# Merging the position counts with cluster totals to calculate percentages
position_percentages <- position_counts %>%
  left_join(cluster_totals, by = "Cluster") %>%
  mutate(percentage = (count / total) * 100)

# Plotting the percentage of each position in each cluster
ggplot(position_percentages, aes(x = as.factor(Cluster), y = percentage, fill = Position)) +
  geom_bar(stat = "identity", position = "stack", color = "black") +
  labs(title = "Position Percentages in Each Cluster",
       x = "Cluster",
       y = "Percentage") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(title = "Position"))

# Calculating similarity between K-means and K-medoids
adjustedRandIndex(final_kmeans$cluster, final_pam$clustering)
## [1] 0.5067637

If we look more closely at the centers and the clusters in general we see that the clustering is very similar to the K-means, with almost the same plot and differentiation between positions, just changed in the order of clusters. This means that our model is robust and does not vary a lot between algorithms.

5.4 Kernel K-means

In K-means and PAM we assume that the data is linearly separable, which may be correct for some datasets, but sometimes they are non-linearly separable which poses a problem for these algorithms. To solve this we have Kernel K-means which can handle non-linear clustering.

For the kernel function we will use the Gaussian (RBF), since it is one of the most common and usually yields good results.

Since we have obtained 5 as the number of clusters for K-means and PAM, and we want to compare the Kernel based results and the ones we obtained previously, we will set the number of centers to 5.

# Kernel K-means with 5 clusters and Gaussian kernel
kernel_kmeans <- kkmeans(as.matrix(scaled_data), centers=5, kernel="rbfdot") 
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# Printing insights
print(centers(kernel_kmeans))
##             [,1]         [,2]        [,3]       [,4]       [,5]       [,6]
## [1,] -0.26219435  0.258535459 -1.22944699 -0.6484362 -0.6470438  0.1447675
## [2,]  0.09249999 -0.083716359  0.88673931  1.0028133  0.9979552  0.5527808
## [3,]  0.05364463 -0.050611178  0.60791754  0.1953056  0.1768157 -0.1387786
## [4,] -0.01979891  0.008712642 -0.06907604  0.3053301  0.3132668  0.3108050
## [5,]  0.44729649 -0.445295088  0.15488516 -1.6022269 -1.5686088 -2.0623640
##            [,7]        [,8]       [,9]      [,10]       [,11]      [,12]
## [1,] -0.6780496 -0.65043076 -0.4982323 -0.7007035 -0.67038604 -0.5763795
## [2,]  0.9873143  1.09722353  1.2114215  0.2837527  0.14893955  0.7573244
## [3,]  0.3436136  0.09175922 -0.2701539  1.1465063  1.17876803  0.2639703
## [4,]  0.2275048  0.30116690  0.1370792  0.1000764 -0.03712026  0.2535976
## [5,] -1.6567517 -1.60265699 -1.2559269 -1.3274700 -0.90062675 -1.2509700
##             [,13]      [,14]      [,15]      [,16]      [,17]        [,18]
## [1,] -0.233459495 -0.6533672 -0.6495869 -0.3533637 -0.7339095  0.085149223
## [2,]  0.314478275  0.7028493  0.4001398 -0.2746526  1.0668943  0.554139619
## [3,] -0.008387275  0.6273914  0.9229338  1.1778734  0.3221264 -0.002137926
## [4,]  0.119765107  0.1944342  0.1217089 -0.2212353  0.2440710  0.239947246
## [5,] -0.308789508 -1.5668609 -1.3248795 -0.4249354 -1.6778832 -2.059710693
##           [,19]      [,20]      [,21]       [,22]       [,23]       [,24]
## [1,] -0.7357046 -0.7527324 -0.7118864 -0.67117167 -0.58734144 -0.62489320
## [2,]  1.1600122  1.1370626  1.2048709  1.26732266  0.03764754  0.09706272
## [3,]  0.2215232  0.2961622  0.1188909 -0.04738864  1.09502814  1.20032993
## [4,]  0.1785429  0.1809495  0.1706865  0.17181852 -0.01945125  0.02204187
## [5,] -1.5717235 -1.6298039 -1.5181491 -1.43681240 -0.71408034 -1.05506325
##            [,25]       [,26]      [,27]       [,28]      [,29]      [,30]
## [1,] -0.66562418 -0.64492695 -0.7683665  0.05893207 -0.7641713 -0.7527570
## [2,]  0.22126972  0.18340386  1.0994707  0.39820053  1.0388175  0.9434299
## [3,]  1.11037309  1.13546907  0.3961694  0.14491777  0.4499908  0.5267102
## [4,]  0.06103788  0.08042132  0.1713226  0.28293344  0.1435085  0.1404587
## [5,] -1.12476074 -1.17865055 -1.6881845 -1.98696616 -1.6142517 -1.5679963
##           [,31]       [,32]      [,33]       [,34]      [,35]      [,36]
## [1,] -0.7376390 -0.51801078 -0.5329695 -0.58873877 -0.5461626  0.2461042
## [2,]  0.9937672 -0.02281665 -0.1971525  0.01583346 -0.1358853  0.2147428
## [3,]  0.4224049  1.19153265  1.3470677  1.34745786  1.3675118  0.2167385
## [4,]  0.1353621  0.05144254 -0.2133435 -0.09810332 -0.1381050  0.3145564
## [5,] -1.5043250 -1.09817535 -0.4771870 -1.05677930 -0.7714875 -2.2805449
##           [,37]      [,38]      [,39]      [,40]      [,41]      [,42]
## [1,] -0.5260427 -0.5260590 -0.2580452 -0.2580422 -0.2568152 -0.2452073
## [2,] -0.1364186 -0.1231060 -0.2580452 -0.2580422 -0.2568152 -0.2452073
## [3,]  1.4015614  1.3834960 -0.2580452 -0.2580422 -0.2568152 -0.2452073
## [4,] -0.1938460 -0.1737754 -0.2580452 -0.2580422 -0.2568152 -0.2452073
## [5,] -0.8014034 -0.8319001  2.4107213  2.4106941  2.3992306  2.2907874
##          [,43]      [,44]      [,45]       [,46]       [,47]       [,48]
## [1,] -0.252049 -0.2522743 -0.2419094 -1.10103366 -1.22399694 -0.27915663
## [2,] -0.252049 -0.2522743 -0.2419094  0.67064447  0.86118801  0.17415742
## [3,] -0.252049 -0.2522743 -0.2419094  0.74039238  0.62396252  0.29536207
## [4,] -0.252049 -0.2522743 -0.2419094  0.04445238 -0.05017494 -0.12546036
## [5,]  2.354704  2.3568083  2.2599772 -0.16457828  0.13461071 -0.03776904
##           [,49]       [,50]      [,51]       [,52]       [,53]         [,54]
## [1,] -1.1373870 -1.06903821 -1.1709321 -1.12156924 -0.20136721 -2.769117e-01
## [2,]  0.7826458  0.77106287  0.8268856  0.82496217  0.01470679  1.171347e-02
## [3,]  0.6831753  0.43959519  0.6735229  0.47800137  0.17540494  3.619791e-01
## [4,] -0.1672671  0.01725029 -0.1511502 -0.01922966  0.04111457 -3.490203e-03
## [5,]  0.1637080  0.18705284  0.1418292  0.18711912  0.09481072 -8.321274e-05
##            [,55]       [,56]        [,57]       [,58]       [,59]      [,60]
## [1,] -0.21740980 -0.02120337  0.008261062  0.01079948  0.11043522 -0.2716794
## [2,] -0.02370872  0.07830815  0.067354193  0.79024332 -0.56884396 -0.2716794
## [3,]  0.27587705 -0.23496128 -0.204719837 -0.68717134  0.90074590 -0.2716794
## [4,]  0.04847219 -0.06621671 -0.108375916  0.13453318 -0.47885432 -0.2716794
## [5,]  0.00815589  0.48944418  0.442356608 -0.77345809  0.02817498  2.5380955
##            [,61]
## [1,]  0.03481609
## [2,] -0.15309372
## [3,]  0.03322942
## [4,]  0.45422887
## [5,] -0.60516990
print(size(kernel_kmeans))
## [1] 541 468 402 336 187
print(withinss(kernel_kmeans))
## [1] 21949.865 27229.798 27018.489  9437.864 53150.698
# Plotting clustering
object.ker = list(data = scaled_data, cluster = kernel_kmeans@.Data)
fviz_cluster(object.ker, geom = c("point"), ellipse=F,pointsize=1)+
  theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")

Again if we look at the plot, it seems that the clustering has been very similar, to check we will compare how similar it is compared with the K-means and PAM.

# Computing similarity between all clustering methods
adjustedRandIndex(final_kmeans$cluster, kernel_kmeans@.Data)
## [1] 0.5533428
adjustedRandIndex(final_pam$cluster, kernel_kmeans@.Data)
## [1] 0.5798562

Just as we predicted before the similarity between the K-means, PAM and Kernel K-means is significant. We can say by this that the data can be linearly separable, since the result with kernel does not change much, and also that the clustering that we have performed seems again very stable.

5.5 Hierarchical clustering

5.5.1 Motivation

Hierarchical Clustering organizes data into a tree type structure. There are different types for it. Following the same explanation as before on PCA, these variables have to be scaled as it is based on distance methods, and therefore different scales could be really costly for the data.

These are: Agglomerative and Divisive Hierarchical Clustering.

On the one hand, agglomerative initally considers each point as a separate cluster, it then computes the distance matrix for the data and joins the two closest points. It repeats this process taking into account the clusters until one cluster with all the points contained is created.

On the other hand, divisive clustering is the opposite of the previous. It considers all data points together on one single cluster, it then divides into smaller clusters, until all points form a singular cluster. It will do this obtaining the two most dissimilar elements in a cluster, and separating them. Computer wise this is a bit more complex as it is harder to find the optimal splits rather than the two closest points.

First of all we will obtain the distances as well as the linkage. As there is no single formula for which distance or linkage method is better, we will plot the graphs using different combinations of a few of them explained below and analyze the results to see which method is better.

For computing the distance matrix there are many different methods such as Euclidean, given by this formula:

\[ d(i, j) = \sqrt{\sum_{k=1}^{n} \left( x_{ik} - x_{jk} \right)^2} \]

We also have Canberra distance given by:

\[ d(i, j) = \sum_{k=1}^{n} \frac{\left| x_{ik} - x_{jk} \right|}{\left| x_{ik} \right| + \left| x_{jk} \right|} \]

There are many more methods such as Manhattan or Minkowsky, but for this project we just want to try a few methods and see which one is better.

There are 4 main linkage methods, which we will test to see which one produces the most interesting result. Linkage methods are used to measure the distance between two clusters taking into account the points inside the clusters

We have simple linkage, which measures the shortest distance:

\[ d(C_1, C_2) = \min \left\{ d(x, y) \mid x \in C_1, y \in C_2 \right\} \]

We have average linkage, measures average distance:

\[ d(C_1, C_2) = \frac{1}{|C_1| |C_2|} \sum_{x \in C_1, y \in C_2} d(x, y) \]

We also have complete linkage, which measures the furthest distance:

\[ d(C_1, C_2) = \max \left\{ d(x, y) \mid x \in C_1, y \in C_2 \right\} \]

And finally we have Ward’s method which focuses on minimizing variance:

\[ d(C_1, C_2) = \sqrt{\frac{|C_1| |C_2|}{|C_1| + |C_2|} \left( \mu_{C_1} - \mu_{C_2} \right)^T \left( \mu_{C_1} - \mu_{C_2} \right)} \]

These is how we computed the different distances and plotted the different graphs.

?stats::dist
## starting httpd help server ... done
?hclust
d1 = dist(scaled_data, method = "euclidean")
d2 = dist(scaled_data, method = "canberra")



hc_euclidean_complete = hclust(d1, method = "complete")
hc_euclidean_single = hclust(d1, method = "single")
hc_euclidean_average = hclust(d1, method = "average")
hc_euclidean_ward = hclust(d1, method = "ward.D")

hc_canberra_complete = hclust(d2, method = "complete")
hc_canberra_single = hclust(d2, method = "single")
hc_canberra_average = hclust(d2, method = "average")
hc_canberra_ward = hclust(d2, method = "ward.D")

To see which method is better we will obtain the silhouette values for each of the hierarchical clusters we have done before. The one who yields a mean value closes to one will be the best one. As this one will indicate that the point is closest to its own cluster rather than the nearest other cluster. If the value is negative this will likely be a misclassification.

Based on our previous clustering analysis we will use the cutree function to cut at 5 clusters, which as seen in the previous methods, it appears to be the best number of clusters for our data.

# Establish the number of clusters where to cut the dendogram
chosen_k = 5

# Calculate Silhouette Scores for Euclidean Distance
silhouette_ec <- silhouette(cutree(hc_euclidean_complete, k = chosen_k), d1)
mean_silhouette_ec <- mean(silhouette_ec[, 3])

silhouette_es <- silhouette(cutree(hc_euclidean_single, k = chosen_k), d1)
mean_silhouette_es <- mean(silhouette_es[, 3])

silhouette_ea <- silhouette(cutree(hc_euclidean_average, k = chosen_k), d1)
mean_silhouette_ea <- mean(silhouette_ea[, 3])

silhouette_ew <- silhouette(cutree(hc_euclidean_ward, k = chosen_k), d1)
mean_silhouette_ew <- mean(silhouette_ew[, 3])

# Calculate Silhouette Scores for Canberra Distance
silhouette_cc <- silhouette(cutree(hc_canberra_complete, k = chosen_k), d2)
mean_silhouette_cc <- mean(silhouette_cc[, 3])

silhouette_cs <- silhouette(cutree(hc_canberra_single, k = chosen_k), d2)
mean_silhouette_cs <- mean(silhouette_cs[, 3])

silhouette_ca <- silhouette(cutree(hc_canberra_average, k = chosen_k), d2)
mean_silhouette_ca <- mean(silhouette_ca[, 3])

silhouette_cw <- silhouette(cutree(hc_canberra_ward, k = chosen_k), d2)
mean_silhouette_cw <- mean(silhouette_cw[, 3])

mean_silhouettes <- c(
  "Euclidean Complete" = mean_silhouette_ec,
  "Euclidean Single" = mean_silhouette_es,
  "Euclidean Average" = mean_silhouette_ea,
  "Euclidean Ward" = mean_silhouette_ew,
  "Canberra Complete" = mean_silhouette_cc,
  "Canberra Single" = mean_silhouette_cs,
  "Canberra Average" = mean_silhouette_ca,
  "Canberra Ward" = mean_silhouette_cw
)

best_method <- names(mean_silhouettes)[which.max(mean_silhouettes)]
best_value <- max(mean_silhouettes)

print(paste("Best method is:", best_method, "with a value of ", best_value))
## [1] "Best method is: Euclidean Single with a value of  0.307839077323307"

From all the methods obtained, we see the best obtained was with Euclidean Distance and single linkage.

d1 = dist(scaled_data, method = “euclidean”)

hc_euclidean_single = hclust(d1, method = “single”)

As these plots are really expensive and R studio takes a lot of time to load these, we plotted them with the players who had played the most.

reduced_data <- full_data[full_data$Overall > 80,]
reduced_numeric_data <- select_if(reduced_data, is.numeric)
scaled_reduced_data <- scale(sqrt(reduced_numeric_data))
d1_reduced = dist(scaled_reduced_data, method = "euclidean")

hc_euclidean_single_reduced = hclust(d1_reduced, method = "single")

fviz_dend(
  x = hc_euclidean_single_reduced,
  k = 5,
  palette = "jco", 
  rect = TRUE, 
  rect_fill = TRUE, 
  rect_border = "jco"
) + 
  guides(color = "none", fill = "none")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

As it will be complicated to see this using a dendogram, we will do it using a phylogenic tree.

fviz_dend(x = hc_euclidean_single_reduced,
          k = 5,
          color_labels_by_k = TRUE,
          cex = 0.8,
          type = "phylogenic",
          repel = TRUE)+  labs(title="Top 5 European Football Leagues Tree Structure") + theme(axis.text.x=element_blank(),axis.text.y=element_blank())

We can see that the clustering here is not good, as we have chosen many players who have played 30 matches or more, most of them will be goalkeepers and then other different positions. As these graphs are really expensive computationally they are really complicated to plot and visualize as there are other constant iterations between the different data distance to assign the clusters.

5.6 EM CLUSTERING

Assumes each data variable follows a probability distribution, in our case Gaussian Distribution, calculates the probability of each point belonging to each cluster based on the current estimates. It will update the parameters such as covariances to maximize the likelihood of the data

set.seed(1)
res.Mclust <- Mclust(scaled_data)
summary(res.Mclust)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust XXX (ellipsoidal multivariate normal) model with 1 component: 
## 
##  log-likelihood    n   df      BIC      ICL
##         38043.7 1934 1952 61315.95 61315.95
## 
## Clustering table:
##    1 
## 1934

The result obtained is that all points belong to one cluster, we also had a different result that stated the best number of clusters was 9 and the results are commented down below. The variation in results is caused as it has random initialization for finding clusters, and the randomness can vary the results.

As with only one cluster the code would return an error, therefore we left the code commented.

As it can be seen, the log likelihood has a value of -12545.42, the most important variables to take into account are BIC, which evaluates the fitness of the model and penalizes for complexity. ICL considers the quality of the clusters assignment. We can also see that most players are in cluster 1, 5 and 6

# We also checked the values with 5 clusters but the results obtained were worse than the 9 clusters obtained, BIC = -48926.59 and ICL = -49119.96
res.Mclust5 <- Mclust(scaled_data, G = 5)
summary(res.Mclust5)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VII (spherical, varying volume) model with 5 components: 
## 
##  log-likelihood    n  df       BIC       ICL
##       -119463.4 1934 314 -241302.9 -241362.5
## 
## Clustering table:
##   1   2   3   4   5 
## 606 133 365 316 514
fviz_mclust(object = res.Mclust, what = "BIC", pallete = "jco") +
  scale_x_discrete(limits = c(1:10))
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?

By looking at this graph, we can see that EM clustering does not return a really good result, as we have many overlapps as well as weakly separated clusters. It is true it separates the goalkeepers which is something we had not seen from the previous methods and it could be it separates good and bad goalkeepers. However the clustering for players is not really good, as most points are on different clusters as the separation between them is not really clear.

5.7 Heatmaps

Heatmaps are fantastic ways to see the relationships of data, it will use dendograms, which are structures already seen in hierarchical clustering.

This will be done, with Euclidean and Single method as we have seen this is the best method.

To plot this heatmap, we will use the pheatmap library as with the number of observations we currently have we will visualize it better

pheatmap(scaled_data,
         clustering_distance_rows = "euclidean",
         clustering_method = "single",
         fontsize_row = 8,   
         fontsize_col = 8,   
         angle_col = 90,      
         treeheight_row = 50, 
         treeheight_col = 50) 

This graph is not really good for visualizing the similarities between players as there are around 2000 players for which the distances are computed and it is really hard to visualize this. But it shows the similarities between many variables so this graph is useful for our project. We can see by looking at the length of the branches, the shortest the branch the closer two variables are related. By looking at the graph we can see that these relations make sense, such as Shots On Target Against and Saves which are really related. Attempted Passes and Touches are also really similar, these are just a few examples, but both of these combinations make a lot of sense.

5.8 Conclusion

After using some different types of clustering, we observed that the one that yielded the best results was the normal K-means or K-medoids. This emphasizes that sometimes the more basic methods work really well, as long as you have a good dataset and preprocessing behind it.

6. CONCLUSION

After analyzing our dataset and performing the different supervised learning techniques, we saw the different benefits PCA provided with the huge reduction of dimensions, FA provided meaningful hidden information and Clustering was useful for grouping similar players. All this allowed us to see players such as Bruno Fernandes or David Raya who had fantastic seasons with their respective clubs. Better and more visual differences between great and average players, or the role of each player on the field, and even many more things!

Finally, we wanted to focus this project not only on the typical coding perspective but also on the meaning behind every algorithm we used, every formula we used for selecting anything and much more. This is why many times we provide mathematical explanations along with some commentaries of our decisions, since we think this is the way to fully understand what we are doing and the subject in itself.